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Kurzfassung 


Massenspektrometrie hat sich in den letzten Jahren zu einer Technologie wei- 
terentwickelt, die für die Identifikation und Quantifizierung von Proteinen in 
biologischen Proben die erste Wahl darstellt und aus der Protein-Forschung 
nicht mehr wegzudenken ist. Auch erste Schritte in Richtung Einsatz in der 
klinischen Routine, besonders für personalisierte Medizin wurden bereits 
gemacht. Aufgrund der Komplexitát und der groDen Menge an Daten von 
diesen Messungen sind entsprechende Softwarelósungen notwendig. Insbe- 
sonders die Identifikation von Proteinen und Peptiden aus Massenspektro- 
metriedaten ist ein wichtiger Schritt bei der Erforschung und Analyse von 
biologischen Proben, der große, vor allem bioinformatische Herausforderun- 
gen mit sich bringt. 

Die vorliegende Doktorarbeit beschreibt Algorithmen zur Analyse genau 
dieser Daten und zeigt auf, wie durch die Berücksichtigung spezieller Eigen- 
schaften der neuesten Massenspektrometrie-Geräte ernome Verbesserungen 
und hóhere Identifikationsraten im Vergleich zu bestehenden, etablierten 
Softwarelósungen erzielt werden kónnen. Ein Teil dieses Frameworks ist ein 
neuer Algorithmus zur Datenbanksuche, MS Amanda, der die Identifikation 
von Peptiden aus Massenspektrometriedaten übernimmt. Weiters wurden 
Algorithmen entwickelt, um so genannte chimerische Spektren — Spektren, 
die mehr als ein Peptid beinhalten — verarbeiten zu kónnen. Dies zeigt das 
brachliegende Potential auf, das noch ungenutzt in diesen Daten steckt. Be- 
reits bei Datensätzen mit Instrumenteinstellungen, die das Auftreten chime- 
rischer Spektren vermindern sollen, treten bis zu 3096 solcher Spektren auf. 
Dieser Wert erhöht sich auf bis zu 60% für komplexere Datensätze. Ohne 
zusätzliche Messzeit können für solche Messungen bis zu 50% zusätzliche, 
vorher unidentifizierte Peptide bei gleicher Konfidenz detektiert werden. 

Alle Ergebnisse dieser Doktorarbeit wurden in anerkannten wissenschaft- 
lichen Journalen publiziert und die Algorithmen frei zur Verfügung gestellt. 
Zusätzlich wurden die Algorithmen in unterschiedliche Softwarepakete inte- 
griert, die weitere Analysen der identifizierten Spektren anbieten. Dadurch 
wurde die Verbreitung in der Community erheblich gesteigert, was sich auch 
durch die hohe Anzahl an Verweisen auf die Publikationen und die Down- 
loadzahlen zeigt. 


Abstract 


Mass spectrometry has emerged as the leading technology for the iden- 
tification and quantification of proteins in biological samples, playing an 
indispensable role in proteomic research. Even first steps towards clinical 
routine especially in terms of personalized medicine have been taken. Due 
to the complexity and the amount of generated data, specific software solu- 
tions are necessary to be able analyze them. Especially the identification of 
proteins and peptides from mass spectrometry data, one of the first but also 
one of the most important steps, is a challenging task. 

'This doctoral thesis describes several algorithms for the analysis of such 
data sets, specifically designed to exploit the power of recent developments in 
instrument design, revealing high resolution and high accuracy data sets. A 
main part of this thesis is a new algorithm for database search, MS Amanda, 
capable of identifying peptides in mass spectrometry data. Applying MS 
Amanda leads to a higher number of identified peptides at the same false 
discovery rate compared to established software solutions. Additionally, 
algorithms have been designed for the identification of chimeric spectra — 
spectra, carrying more than a single peptide -, revealing a potential, that 
otherwise remains unexploited. Even for data sets with instrument settings 
to avoid the occurrence of chimeric spectra, up to 3096 of such spectra are 
measured, rising to 6096 for complex samples. Up to 5096 additional unique 
peptides, that would have remained unidentified, can be found at no extra 
measurement time applying a chimeric search. 

All results of this doctoral thesis have been disseminated through pub- 
lication in internationally renowned journals and presentations at various 
conferences relevant for the proteomics community. All algorithms have 
been made available free of charge and are integrated in various software 
packages, enabling further downstream analyses of the identified spectra. 
'These efforts had great impact on the international awareness of the al- 
gorithms presented in this thesis, also revealed by the number of citations 
and the number of downloads of the software. 
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Chapter 1 


Introduction 


Proteins are essential components of all living cells. The DNA of an or- 
ganism is an important part, but proteins often make the difference. As an 
example, the DNA of a caterpillar that afterwards transforms into a but- 
terfly is the same for the caterpillar and the subsequent butterfly, only the 
expressed proteins essentially change the appearance of the animal. In addi- 
tion, the absence or overabundance of certain proteins are often the trigger 
of certain illnesses. To understand and investigate certain functionalities 
in a cell, the cause of an illness, or the fundamentals behind the metamor- 
phesis of a caterpillar into a butterfly, the identification and the subsequent 
quantification of proteins in a cell are essential. 


1.1 Mass Spectrometry 


Mass spectrometry has evolved into an indispensable approach in the ana- 
lysis of proteins [2] [4]. Insight in the function, structure, and purpose of 
a protein helps to understand the mechanisms in a cell, as proteins are re- 
sponsible for almost all tasks in an organism [71]. Mass spectrometry-based 
proteomics can provide information on the proteins present in a biological 
sample (e.g., a specific tissue, such as blood, liver, or kidney), on their 
quantities, and on their interaction partners through a large variety of high- 
throughput technologies [19]. 

Mass spectrometers measure the mass-to-charge ratio of ions and their 
abundance and consist of three major parts: 


e Ion source 
e Mass analyzer 
e Detector 


'The ion source is responsible for the generation of charged particles, as 
only these can be identified by the detector. The mass analyzer separates 


ions based on their mass-to-charge ratios which can then be measured by 
the detector. An overview of the different ion sources and mass analyzers 
can be found in Figure see also Section for further details. A 
schematic workflow of a typical proteomics mass spectrometry experiment 
(shotgun proteomics) is shown in Figure [1.1] 
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Figure 1.1: Schematic workflow of a typical mass spectrometry experiment 
taken from Nesvizhskii et al. [72]. Prior to analysis in a mass spectrometer, 
biological samples have to be preprocessed (including enzymatic digestion) 
and separated (based on specific physico-chemical properties). Mass spec- 
trometers generate (tandem) mass spectra of ionized peptides. Generated 
mass spectra have to be subsequently analyzed and interpreted, most often 
by using database search approaches. 


1.1.1 Sample Preparation and Separation 


Prior to the analysis of biological samples in so-called bottom-up proteomics 
experiments in a mass spectrometer, sample preparation is necessary. First 
steps include the denaturation of the protein's 3D structure, for example by 
heating up the protein or breaking disulfide bonds between cysteins, that 
stabilize the protein's structure. Subsequently, proteins are proteolytically 
digested, i.e., broken into smaller parts, so-called peptides, through specific 
enzymes. These enzymes cut proteins at specific cleavage sites, either be- 
fore or after specific amino acid patterns. Trypsin, for example, is a very 


commonly used enzyme that cuts after lysine (K) and arginine (R), except 
if they are followed by a proline (P). 

Depending on the cleavage pattern and on the average occurrence of 
these amino acids in a protein, certain enzymes produce longer peptides 
than others. In addition, some enzymes are more efficient in enzymatic di- 
gestion than others. Each enzyme has its error rate leading to so-called 
missed cleavages, where a cleavage site is overlooked, see Figure [1.2] for an 
example. 
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Figure 1.2: Potential cleavage sites of peptide MANPAKSLVDISLRDPA- 
GINTYGQVYKGRHVKTGQRPLAA using trypsin and considering one 
missed cleavage. Trypsin cleaves after lysine (K) or arginine (R). 


Resulting peptides are usually further separated through liquid chro- 
matography (LC) or high performance liquid chromatography (HPLC) ac- 
cording to specific physico-chemical properties (e.g., hydrophobicity) of the 
peptide and subsequently analyzed in a mass spectrometer. Separating pep- 
tides prior to the mass spectrometry analysis prevents thousands of peptides 
entering the mass spectrometer at the same time, making it impossible to 
analyze the sample [60]. 


1.1.2 Mass Spectrometry 


Mass spectrometers utilized in the field of proteomics measure the mass-to- 
charge ratio (m/z) and the amount of molecules in a sample. First, peptides 
are ionized in the ion source, separated and measured in the mass analyzer, 


and the number of ions at the same m/z value are determined in the detector. 

The two techniques most commonly applied to ionize peptides are matrix- 
assisted laser desorption/ionization (MALDI) [48] and electrospray ioniza- 
tion (ESI) [28]. In terms of mass analyzers, a big variety of methods exist, 
including time-of-flight (TOF), ion trap, or Fourier transform ion cyclotron 
resonance. An overview of the different methods can be found in Figure[1.4] 
where the upper two graphs show the two ionization techniques and the 
lower graphs explain the principles of the analyzers. The instrument mea- 
sures the m/z values of the ionized peptides (referred to as precursor ions) 
resulting in so-called “MS1 spectra" (see Figure [L.3) [79]. 
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Figure 1.3: MS1 spectrum of a human cancer cell (HeLa) sample, measured 
on a Thermo Fisher Q Exactive, as shown by the Xcalibur? M software. Each 
peak represents a peptide or noise. 


1.1.3 Tandem Mass Spectrometry 


To infer a peptide's sequence from its m/z value (so-called *peptide mass 
fingerprinting") and its subsequent protein mapping is a rather challenging 
task when analyzing complex protein samples. A certain measured mass 
value of a peptide could be explained by hundreds or thousands of diffe- 
rent combinations of amino acids, so these techniques may lead to ambigu- 
ous matches [I]. Alternatively, tandem mass spectrometry can be applied, 
where peptides are further processed into smaller ions retaining sequence- 
specific information [74]. In data-dependent acquisition (DDA), which is 
still the most frequently used approach, the top N intense peaks in the 
MS1 spectrum are selected for further fragmentation — each precursor ion 
individually [57]. This is done in contrast to data-independent acquisition 
(DIA), where all precursors in a certain mass range are selected together for 
fragmentation [84]. 
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Figure 1.4: Overview of mass spectrometer types used in proteomics ex- 
periments, taken from Aebersold et al. [2]. The top two illustrations show 
the two most common ionization methods: electrospray ionization (ESI, up- 
per left), and matrix-assisted laser desorption/ionization (MALDI, upper 
right). Parts a-f show various configurations for mass spectrometer instru- 
ments, describing principles of time-of-flight ((TOF) and various types of ion 
trap instruments. Further instrument details can be found in Aebersold et 


al. [2]. 


Here, a certain amount of the peptide of interest is collected, which are 
subsequently fragmented. Fragmentation can be achieved by collision with 
an inert gas in the so-called “collision cell" (see Figure[1.4) [60]. Depending 
on the type of the collision cell, various fragmentation types can occur, as 
peptides may break at different positions (Figure 11.5), but will not neces- 
sarily break at every position. The nomenclature of those ions includes a, 
b, and c ions for n-terminal fragments and z, y, and z ions for fragments 
containing the c-terminus of the peptide [7] [77]. Typically, in CID and 
HCD instruments b and y ions are generated, whereas in ETD spec- 
tra mainly c and z ions occur. Recent developments also allow for double 
fragmentation resulting in so-called EThcD spectra [32]. 

Resulting spectra are so-called MS2 or MS/MS spectra (see Figure[1.6), 
which contain peaks of peptide fragments [60]. In DDA experiments, these 
spectra are often assumed to contain peaks of only one certain peptide to 
ease spectrum identification, see Section [1.5] for a detailed discussion. 


1.2 Identification of MS/MS Spectra 


'The interpretation of MS2 spectra is a challenging task, but an essential 
one, as the peptides identified in a biological sample will provide further 
insight into the functionality and the underlying biological processes. As 
mentioned before and shown in Figure breakpoints of the peptide are 
instrument-dependent. Ions a, b, and c are n-terminal ions, starting at the 
beginning of the peptide, while z, y, and z ions start at the c-terminus of 
the peptide. Labels of fragment ions also carry a number, accounting for the 
number of amino acids in the fragment ion, e.g., a y2 ion would contain the 
two c-terminal amino acids. The mass of all possible fragment ions can be 
calculated using the following formulas, where k is the singly charged k-th 
fragment ion, representing the number of amino acids in the ion: 


mass(a,) = Y mes(44) — mass(CO) + mass(p*) (1.1) 
mass(by) = > mass(AA;) + mass(p*) (1.2) 
mass(cy) = 2 mass(AA;) + mass(N H3) + mass(p*) (1.3) 
mass(zp) = 2 mass(AA;) + mass(CO2) + mass(p*) (1.4) 
mass(yx) = p mass(AA;) + mass( H20) + mass(p^) (1.5) 
mass(z,) = E mass(AA;) + mass(O) — mass( N.H) + mass(p*) (1.6) 


An example for all potential b and y ions of the peptide GISHVIVDEI- 
HER can be found in Table[L.1] 

Usually, this information is used in bioinformatics tools that are applied 
to identify the representing peptide of a certain MS2 spectrum to infer the 
corresponding protein thereafter. A perfect CID spectrum for a certain 
peptide would, for example, contain all possible a ions, b ions, and y ions 
of the peptide and no other peaks. Unfortunately, such a spectrum rarely 
exists, due to several reasons, including the necessity of having at least 
one charge attached to the ion. Some amino acids are more likely to carry a 
charge than others [75], leading to peptide sequence-dependent mass spectra. 


NH, 


Figure 1.5: Peptide fragmentation types in tandem mass spectrometry [77] 
[7]. a, b, and c ions contain the peptide n-terminus, whereas x, y, and z 
ions include the c-terminus. Numbers indicate the number of amino acids 
included in the fragment. 


Depending on the prior knowledge to interpret the spectrum, three ap- 
proaches exist to the current state: 


e De novo identification 
e Spectral library identification 
e Database search 


All approaches try to find the peptide representing the underlying spec- 
trum, with regard to several pre-definable settings, such as mass tolerance 
or considered modifications (see Figure |1.7). 


1.2.1 De Novo Identification 


When applying de novo identification (also called de novo sequencing) to 
identify tandem mass spectra, no prior knowledge is required, as here only 
the information present in the spectrum is used to interpret the spectrum , 
De novo identification means to search for mass differences of amino acids 
between (high intense) peaks in the spectrum, leading to so-called sequence 
tags [61], i.e., consecutive peaks representing the ions of the underlying pep- 


tide (see Figure [1.7]. 
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Figure 1.6: MS2 spectrum of a human cancer cell (HeLa) sample, measured 
on a Thermo Fisher Q Exactive, as shown by the Xcalibur?” software. Each 
peak represents a fragment ion or noise. 


A typical de novo workflow to identify a peptide in a CID spectrum 
would include the following steps: 


e Estimate peptide length 

e Identify aa/ba ion pair and n-terminus of peptide 
e Identify c-terminus of peptide 

e Identify peaks with amino acid distances 

e Complete y series 

e Try to complete b series 

e Verify peptide mass 

e Check for unexplained high peaks 


Estimating the length of the peptide sequence can be performed by di- 
viding the mass of the precursor by the average molecular weight of an amino 
acid, which is 110 Dalton (Da). The a2/b» ion pair is often very prominent 
in CID spectra. a; and bj ions are n-terminal ions, containing 7 amino acids 
and differing by a mass of about 28 Daltons (see Equation [1.1] and [1.2]. 

In previous years, de novo identification has often been used for manual 
spectrum interpretation [59]. Nowadays, several algorithms exist that are 
able to perform automatic de novo sequencing or sequence tag identifica- 
tion on tandem mass spectra, including SHERENGA [15], Lutefisk [44], 
MSNovo [65], pNovo [93], GutenTag [81], DirecTag [82], or PepNovo [30]. 
Recent developments also allow for the identification of chimeric spectra (see 
Section [1.5] for further information) [37]. 


bt Sequence yt 
58.029 G 

171.113 I 1446.770 
258.145 S 1333.685 
395.204 H 1246.654 
494.272 V 1109.595 
607.356 I 1010.527 
706.425 V 897.442 
821.452 D 798.374 
950.494 E 683.347 
1063.578 I 554.305 
1200.637 H 441.220 
1329.680 E 304.162 
R 175.119 


Table 1.1: All possible b and y ions of peptide GISHVIVDEIHER. 


Performing de novo sequencing is, however, a challenging task, as it 
is computationally very expensive and high resolution MS/MS spectra are 
necessary to obtain good results [72]. Therefore, this approach is rarely used 
for analyzing standard proteomics data sets. Still, there are certain cases 
where de novo identification is of great value, such as when investigating 
unknown or poorly studied species [69]. It can also be used for detecting 
unknown PTMs or validating results obtained by database search (see 


Section [86]. 


1.2.2 Spectral Libraries 


An emerging field in peptide identification is the so-called spectral library 
search. Here, query spectra are compared to libraries of already identified, 
experimentally measured mass spectra (see Figure[1.8). Advantages of this 
approach are numerous, as these libraries only contain detectable peptides 
and spectra with intensity information and peaks of non-standard ions : 
Searches performed with spectral library search engines can therefore yield 
better results at lower runtime compared to all other approaches [96], if an 
appropriate library is available. This already indicates the drawback of this 
approach: only peptides that are included in the library can be identified. 
Recent developments show an increased effort in generating such spectral lib- 
raries (such as the National Institute of Standards and Technology (NIST, 
http://peptide.nist.gov/) or the PROPEL library [97], and in developing 
tools to create custom libraries [56]. Although these libraries only 
cover some of the standard organisms normally used in research, the field is 
moving into this direction. Several algorithms that enable spectral library 
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Figure 1.7: Peptide identification approaches currently available taken from 
Nesvizhskii et al. [72]. 


searching have been developed, including SpectraST [55], X!Hunter [14], 
Bibliospec [33], MSPepSearch [68], M-Split [89], and Pepitome [16]. How- 
ever, most of them provide only limited benefits in a daily routine due to 
missing maintenance, missing file format support, or required expert /pro- 
gramming knowledge for executing the tool [39]. 


1.2.3 Database Search 


Comparing tandem mass spectra to a database of known proteins, the so- 
called database search (see Figure[1.9), is the most used approach in bottom- 
up proteomics experiments [53]. Here, a list of known proteins of the or- 
ganism of interest is digested in-silico using the same enzyme as in the 
sample pre-processing step (see Section [1.1.1], leading to a list of peptides 
with certain masses. For each MS2 spectrum, all candidate peptides in a 
certain mass range of the precursor (i.e., the mass of the complete peptide 
measured in the spectrum) are collected from the database. Subsequently, 
a theoretical spectrum is calculated for each of the candidates considering 
mass spectrometry-specific fragmentation patterns: As discussed, depending 
on the collision cell used for fragmentation, specific ion types are more or 
less likely to occur in the MS2 spectrum (e.g., b and y ions for HCD and CID 
spectra, or c and z ions for ETD spectra (see Section [1.1.3]). All potential 
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Figure 1.8: Spectral Libray search principle taken from Li et al. [58]. Pre- 
viously experimentally acquired spectra with known peptides are compared 
to target spectra. 


ions for a specific peptide are calculated according to the Equations [1.1] - 
(here given as singly charged ions) constituting the so-called theoretical 
spectrum. 

All theoretical spectra are compared to the experimentally determined 
MS2 spectrum and a score reflecting the quality of the match is calcu- 
lated. To date, more than 40 search engines have been developed and pub- 
lished [87], differing mainly in the method of rating peptide-to-spectrum 
matches. This includes pioneers of database search such as SEQUEST 
or Mascot [76], developed more than two decades ago, but also recent al- 
gorithms considering advances in mass spectrometry instrument develop- 
ment such as MS-GF+ [52], or Morpheus [90]. According to the num- 
ber of citations from 1994 to 2016 [87], the most used database search al- 
gorithms are Mascot (4976), SEQUEST (3844), X!Tandem (1228), and 
Andromeda (1009). In principle, scoring approaches can be divided into 
two categories: (a) correlation scores between the theoretical and experimen- 
tal spectra, and (b) probabilistic approaches considering the probability of 
random matches [87]. 


Correlation Scores 


SEQUEST is the most prominent representative in this category. Here, 
a cross-correlation score (XCorr) is calculated between the experimentally 
measured spectrum and the generated theoretical spectrum, where all ions 
have a specific fixed intensities (see Figure |l.10). X!Tandem uses a 
cross-correlation score to determine the quality of a match in its hyperscore. 
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Figure 1.9: Illustration of peptide identification using database search taken 
from Verheggen et al. [87]. A search engine has to be able to read and filter 
theoretical spectra, to read the sequence database and generate theoretical 
spectra based on the instrument type, to calculate peptide spectrum matches 
(PSMs) for each spectrum and the corresponding potential candidates in the 
database, and to output the best matches. 


MS/MS data preparation 


100 
constructed 
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each segment normalized to 50 Companson 
(FFT) 


Figure 1.10: Theoretical spectrum generation in SEQUEST taken from 
Kapp et al. [47]. All ions have a fixed intensity and are compared to the 
normalized experimental spectrum by calculating a cross-correlation score. 
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Probabilistic Scores 


In contrast to the deterministic cross-correlation scores, probability-based 
scoring models estimate the probability that the given peptide spectrum 
match originated from a random event. This approach was first used in 
1999 in the Mascot algorithm, although the real algorithm has never been 
published and is kept as secret since then [76]. Several algorithms, such as 
OMSSA or X!Tandem [13], follow a similar approach by also estimating 
the chance of matching a certain peptide being a random event. 


Recent Advancements 


Recent innovations in database search include the so-called *open search? 
or “blind search”, where all spectra are searched with a wide precursor mass 
tolerance, such as, e.g., 200 Da, allowing for matching peptides with post- 
translational modifications [70]. The development of MSFragger in 
2017 enabled the use of such an approach in a daily routine, as searches 
can be finished in a reasonable amount of time due to sophisticated peptide 
indexing. 


1.3 Post-Translational Modifications 


Regulatory processes in cells are often enabled or deactivated through so- 
called post-translational modifications (P'TMs) of certain proteins [49]. Here, 
specific molecules dock on proteins, often leading to a conformational change 
of the protein blocking or activating certain binding sites. Currently, se- 
veral hundreds of different PTMs are known and listed in various data- 
bases [5]. A list of the most commonly observed PTMs in Swiss- 
Prot can be found in Table 

In addition, modifications are often introduced on purpose during sample 
preparation, e.g., to dissolve protein 3D structures, making as many clea- 
vage sites available for enzymatic digestion [40]. Carbamidomethylation of 
cysteine is, for example, one of these introduced modifications. When try- 
ing to identify peptides in mass spectra, these modifications have to be 
considered, regardless of the used approach. This is not an easy task, as the 
increasing number of considered modifications lead to an increasing number 
of candidate peptides. Post-translational modifications are potential pitfalls 
of spectrum identification and may lead to erroneous results [5I]. Having 
identified a certain P'TM to be present on the peptide does not always mean 
to be able to identify the exact location of the P'TM. This is where PTM 
localization tools come into play, such as phosphoRS [83], MD-Score [78], 
PTM Score [67], or A-Score [6]. They are normally applied after the pep- 
tide identification process to determine the correct modification site of the 
identified peptide sequence. 


13 


Modification Frequency 
phosphorylation 57191 
acetylation 6656 
n-linked glycosylation 5343 
amidation 2830 
hydroxylation 1608 
methylation 1497 
o-linked glycosylation 1104 
ubiquitylation 843 
pyrrolidone Carboxylic Acid 810 
sulfation 490 
gamma-Carboxyglutamic Acid 450 
sumoylation 393 
palmitoylation 271 


Table 1.2: Most common experimentally observed PTMs in the Swiss-Prot 
database. Adapted from Khoury et al. [50]. 


1.4 Validation of Peptide Spectrum Matches 


Having identified a peptide in a mass spectrum does not necessarily mean 
that this is correct. Besides that the algorithm may not be working properly, 
several other reasons for false identifications exist, e.g., in database or library 
search, the correct peptide may for example not be present in the database. 
Moreover, the underlying peptide could be post-translationally modified and 
the modification has not been accounted for, or there could also be just not 
enough deterministic peaks in the spectrum. The identification algorithm 
will assign the “best matching candidate" to the spectrum, where the score 
will indicate the goodness of the fit. Still, the question remains, which 
score one can trust. It is therefore crucial to filter the results down to 
those candidates that are very likely to be correct. In mass spectrometry 
experiments, this is normally done by false discovery rate (FDR) estimation, 
where the amount of false identifications among a certain set of matches 
is estimated [26]: Therefore, the search is not only performed against the 
database/library of known peptides (the so-called “target database"), but 
also against a database containing only fictional peptides, which is the so- 
called *decoy database". For database search, several different methods of 
generating decoy peptides out of the available target peptides have been 
proposed [26], including: 


1. random shuffle 
2. reverse 


3. pseudo reverse 
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Figure 1.11: Distribution of false targets and decoys taken from Elias et 
al. [26]. The z-axis represents the ranks for each match assigned by SE- 
QUEST. Rank 1 is the best matching peptide, rank 10 the 10th best pep- 
tide. The y-axis shows the percentage of these ranks belonging to either the 
target database (blue) or the decoy database (red). It is equally likely to 
match a false target peptide or a decoy peptide. Peptides at rank two and 
higher are the second/third/... best hits and are normally false matches. 


Pseudo reverse in the context of generating decoy databases means to 
maintain enzymatic digestion patterns, i.e., keeping the amino acid that has 
been the enzymatic cleavage target in place and reversing all the other amino 
acids. To fulfill the constraints of the target-decoy-apprach (TDA), the two 
databases have to be of equal size. The approach assumes that false matches 
to the target database are equally likely as matches to the decoy database 
(see Figure[1.11). By that assumption, one can estimate the number of false 
matches in the target hits by the number of matches to the decoy database 


(see Figure [1.12]: 


#decoyMatches 


FDR= 
#targetMatches 


#falseMatches an) 


u #correctMatchesInTarget + #falseMatchesInTarget 


1.5 Chimeric Spectra 


Most of the database algorithms assign the best matching peptide to the cor- 
responding tandem mass spectrum, following the one-spectrum-one-peptide 
rule. However, due to overlapping retention times and similar m/z values, 
multiple precursors can be co-fragmented and represent the starting points 
for fragment ions in a single spectrum. On the one hand, the resulting 
spectra — so-called mixed or chimeric spectra — complicate the identification 
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Figure 1.12: FDR calculation for identified peptides [43]. Decoy and target 
databases have to be of the same size. By proper decoy generation, the 
number of matches to the decoy database can be used to estimate the number 
of false matches to the target database. 


process, but on the other hand, they also carry a great potential. Ad- 
ditionally identified peptides in a tandem mass spectrum can be used to 
either confirm an already identified peptide to be present in another spec- 
trum, or reveal unidentified, potentially low-abundant peptides. Several 
approaches capable of identifying chimeric spectra have already been pub- 
lished [94] [79] [95], still they are often not used or easily applicable 
in a traditional proteomics workflow. Peptide spectrum matches originating 
from co-eluting peptides have to be validated separately from PSMs of ori- 
ginal precursor peptides, a functionality often not easily applicable without 
bioinformatics skills. 


1.6 Bioinformatics Challenges in Peptide Identi- 
fication 


Due to recent developments in mass spectrometry instrumentation inclu- 
ding Higher-Energy Collisional Dissociation (HCD) [73], Electron Transfer 
Dissociation (ETD) [80], electron-transfer and higher-energy collision disso- 
ciation (EThcD) [81] and high resolution mass spectrometers such as Orbi- 
traps, the need for efficient and accurate identification algorithms arises. As 
a consequence, current gold-standard algorithms such as Mascot [76] and 
SEQUEST [27], which were developed more than a decade ago, might not 
optimally be suited for the types of mass spectra available today. 
Changing the tolerated mass error for fragment mass peaks of MS/MS 
spectra from broad (0.8 Da, often used for spectra of low accuracy) to narrow 
(0.02 Da, often used for spectra of high accuracy) does not have a significant 
effect on the achieved scores (see Figure[1.13]. If a score is considered to be 
a measure of correctness of the identification, such a change might however 
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be expected. 

In addition, to be able to fully trust the results of a search engine, the 
functionality and the scoring function of the algorithm should be known. 
'The most popular search algorithm Mascot has been published following the 
black box model, providing an interface to put in spectral data and receive 
somehow scored and validated peptide spectrum matches without stating 
how these results have been created. Therefore the need for a white box 
algorithm, especially designed for the new generation of mass spectrometers, 
providing data of high resolution and high accuracy, arises to provide users 
a readily available and comprehensible algorithm to accurately identify the 
peptides and proteins in their measured data sets. 


Score Comparison 


0.02 Da 


MS2 tolerance 


MS2 tolerance = 0.8 Da 


Figure 1.13: Comparison of Mascot scores on a human cancer cell data set 
measured on a Thermo Fisher Q Exactive at a 1h gradient. The data set has 
been searched using Mascot at two different fragment ion mass tolerances, 
i.e., 0.8 Da and 0.02 Da. Mascot scores of the same peptide spectrum match 
do not differ substantially between the two strategies, although this would 
be expected if the score is a measure of the correctness of the identification. 
Lower tolerances imply higher mass accuracies, making it more difficult to 
randomly match a fragment ion, increasing the certainty of a match. This 
is however not reflected in the score. 
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Chapter 2 


Contributions of the Author 


The work presented in this thesis is addressing the previously mentioned 
issues in identification of peptides in tandem mass spectrometry data. All 
results have been disseminated in conference talks, poster presentations, and 
the following journal publications: 


Viktoria Dorfer, Peter Pichler, Thomas Stranzl, Johannes Stadlmann, 
Thomas Taus, Stephan Winkler, and Karl Mechtler (2014). 

MS Amanda, a universal identification algorithm optimized for high 
accuracy tandem mass spectra. Journal of Proteome Research, 13(8), 
3679-3684. https:/ /doi.org/10.1021/pr500202e 


Viktoria Dorfer, Sergey Maltsev, Stephan Winkler, and Karl Mechtler 
(2018). CharmeRT: Boosting peptide identifications by chimeric spec- 
tra identification and retention time prediction. Journal of Proteome 
Research, 17(8), 2581-2589. 
https://doi.org/10.1021/acs.jproteome.7b00836 


Marina Strobl, Sergey Maltsev, Stephan Winkler, Karl Mechtler, and 
Viktoria Dorfer. MS Amanda 2.0: Recent advancements and updates 
for the MS Amanda search engine. Manuscript in preparation. 


Viktoria Dorfer, Sergey Maltsev, Stephan Dreiseitl, Karl Mechtler, 
and Stephan Winkler (2015). A Symbolic Regression Based Scoring 
System Improving Peptide Identifications for MS Amanda. Procee- 
dings of the Companion Publication of the 2015 Annual Conference on 
Genetic and Evolutionary Computation (pp. 1335-1341). New York, 
NY, USA: ACM Press. https://doi.org/10.1145/2739482.2768509 


Eric W. Deutsch, Yasset Perez-Riverol, Robert J. Chalkley, Math- 
ias Wilhelm, Stephen Tate, Timo Sachsenberg, Mathias Walzer, Lu- 
kas Kall, Bernard Delanghe, Sebastian Bocker, Emma L. Schymanski, 
Paul Wilmes, Viktoria Dorfer, Bernhard Kuster, Pieter-Jan Volders, 
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Nico Jehmlich, Johannes P. C. Vissers, Dennis W. Wolan, Ana Y. 
Wang, Luis Mendoza, Jim Shofstahl, Andrew W. Dowsey, Johannes 
Griss, Reza M. Salek, Steffen Neumann, Pierre-Alain Binz, Henry 
Lam, Juan Antonio Vizcaíno, Nuno Bandeira, and Hannes Róst (2018). 
Expanding the Use of Spectral Libraries in Proteomics. Journal of Pro- 
teome Research, 17(12), 4051-4060. 
https://doi.org/10.1021/acs.jproteome.8b00485 


e Sander Willems, David Bouyssié, Matthieu David, Marie Locard-Paulet, 
Karl Mechtler, Veit Schwämmle, Julian Uszkoreit, Marc Vaudel, and 
Viktoria Dorfer (2017). Proceedings of the EuBIC Winter School 2017. 
Journal of Proteomics 161, 78-80. 
https://doi.org/10.1016/j.jprot.2017.04.001 


e Sander Willems, David Bouyssié, Dieter Deforce, Viktoria Dorfer, 
Vladimir Gorshkov, Dominiki Kopczynski, Kris Laukens, Marie Locard- 
Paulet, Veit Schwämmle, Julian Uszkoreit, Dirk Valkenborg, Marc 
Vaudel, and Wout Bittremieux (2018). Proceedings of the EuBIC 
Developer's Meeting 2018. Journal of Proteomics, 187, 25-27. 
https: / /doi.org/10.1016/j.jprot.2018.05.015. 


The work of this thesis has also been relevant for the publication “PhoStar: 
Identifying Tandem Mass Spectra of Phosphorylated Peptides before Data- 
base Search.” (Dorl, Winkler, Mechtler & Dorfer) and has been presen- 
ted at numerous conferences, user meetings, and workshops (such as ASMS, 
ISMB, EuPA, Proteome Discoverer User Meeting, de.NBI Summer School, 
MedGEC, or APRS). 


2.1 Peptide Identification 


As discussed in Section [1.6] the need for a white box algorithm being able to 
exploit the potential of the new generation of highly accurate tandem mass 
spectrometers was apparent. The paper “MS Amanda, a universal identifica- 
tion algorithm optimized for high accuracy tandem mass spectra", published 
in the Journal of Proteome Research, 2014, is included in this thesis, see 
Chapter |3| This publication describes the MS Amanda algorithm, a novel 
approach for peptide identification especially designed for high-accuracy tan- 
dem mass spectra , 
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The scoring algorithm, the core of each peptide identification algorithm, 
consists of four major parts, see Figure 


e Peak picking depth determination 
e Random match probability calculation 
e Consideration of explained intensity 


e Score readability enhancement 


eso c 


ME spectrum P......... probability of random match 
pep....peptide eif.......explained ion flow 
m........ peak picking depth 


Figure 2.1: Scoring algorithm of the MS Amanda search engine [24]. Scores 
are calculated by combining the probability to match a certain number 
of peaks by chance, the explained intensity, and the optimal peak picking 
depth. Calculated numbers are then log-transformed for higher readability. 


Comparisons to other search engines have shown that MS Amanda im- 
proves upon the well-known search tools as it confidently explains a higher 
number of spectra at the same false discovery rate, see Figure In 
addition, a high overlap of identified peptides with Mascot [76] and SE- 
QUEST was achieved, see Figure [2.3] 

It is apparent that the proteomics community was eager and ready for 
the development of novel approaches for peptide identification in mass spec- 
trometry data. As of February 2019 according to Google Scholar the paper 
describing the MS Amanda algorithm has been cited 144 times since its 
publication in 2014 [36]. The MS Amanda software package has been down- 
loaded more than 6000 times. 
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Figure 2.2: Comparison of the MS Amanda algorithm to standard database 
search engines Mascot and SEQUEST [24]. The underlying data set is a 
human cancer cell line measured and published by Michalski et al. [64]. 


Figure 2.3: Overlap of peptides of a single replicate of Figure [2.2] identified 
by MS Amanda, Mascot and SEQUEST [24]. In addition to the high overlap, 
MS Amanda identifies a substantial number of peptides unidentified by the 
other search engines. 
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2.2 Chimeric Spectra Identification 


Chapter |4| describes a novel framework for the identification of chimeric 
spectra, i.e., multiple peptides in a spectrum, which has been published as 
"CharmeRT: Boosting Peptide Identifications by Chimeric Spectra Identifi- 
cation and Retention Time Prediction" in the Journal of Proteome Research, 
2018. Several different approaches have been tested and presented at various 
conferences [22], leading to an improved strategy for chimeric peptide 
identification [23]. The second part of the CharmeRT paper, the validation 
of chimeric spectra identifications, is not part of this thesis and is mainly the 
work of the co-author of this paper, Sergey Maltsev. Chimeric spectra are 
identified in a two-step process, performing two consecutive searches. The 
following steps are conducted to identify multiple peptides: 


1. Original precursor peptide identification 
2. Identified peaks removal 

3. Co-eluting precursor candidate selection 
4. Co-eluting precursor peptide identification 


We found that already in samples with instrument settings designed to 
avoid co-eluting peptides (1h gradient and 2m/z isolation width), more than 
3096 of all spectra carry a second peptide, increasing to more than 6096 for 
very complex samples, as shown in Figure [2.4] Figure[2.5|shows the benefits 
of chimeric spectra identification, as it is obvious that without investing any 
further instrument acquisition time, a high number of additional peptides 
can be identified even in the simpler biological samples only by applying 
chimeric spectra search. 


2.3 Related Work 


Chapter[5|describes related work also by the author of this thesis. Section[5.1] 
includes a paper, where different strategies have been tested for validating 
peptide identifications 21]. A common method to extend the number of 
confidently identified peptides is the usage of Percolator [46], a support 
vector machine trained to separate target from decoy peptides based on 
the false discovery rate assumptions (see Section [1.4]. 'The author tested 
similar methods to perform this step, namely random forests [8] and genetic 
programming [54], indicating a general benefit of using machine learning 
methods for peptide spectrum match validation. 
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Figure 2.4: Occurrence of chimeric spectra for human cancer cell samples 
measured with different isolation widths (2-8 m/z) and different gradient 
times (1h/3h) [23]. Already 30% of all confidently explained spectra carry 
at least a second peptide at instrument settings designed to avoid co-eluting 
precursors (1h gradient, 2m/z isolation width), rising to more than 6096 for 
very complex samples. 


In addition, constant work has been performed to extend and maintain 
the MS Amanda algorithm, such as, but not limited to: 


e Implementation of new ion types for UVPD spectra [9] 
e Performance optimizations 

e Support for negative mode 

e Extended PTM support 

e Support for standard input and output formats 


Results of these works have already been disseminated at various conferences 
(ASMS, APRS, EuBIC Winter School, Pro-MET Meeting) and a manuscript 
summarizing these extensions is in preparation and will be published in 
the upcoming months. (Strobl M., Maltsev S., Winkler S., Mechtler K., 
Dorfer V., *MS Amanda 2.0: Recent advancements and updates for the MS 
Amanda search engine." Manuscript in preparation.) 

Spectral library search (see Section has gained growing interest 
in the last years, accompanied by new challenges. In this context, a first 
community paper has been drafted with the contribution of the author of 
this thesis and has been published in the Journal of Proteome Research, 
2018. Section [5.2] contains this publication. 
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Figure 2.5: Identification comparison of with and without chimeric spectra 
search enabled for human cancer data sets measured at various isolation 
widths and gradient times [23]. The highest benefit can be achieved using 
4m/z isolation width and a gradient time of 3h, enabling chimeric spectra 
generation. Up to 45% more unique peptides at 1% FDR can be identified 
using the chimeric spectrum identification approach. 


The author has also actively participated in building up a bioinfor- 
matics proteomics community in Europe, called EuBIC (European Bioin- 
formatics Community, https://www.proteomics-academy.org/) as part of 
the European Proteomics Association (EuPA). In this context, several bioin- 
formatics hubs and workshops have been organized at various conferences 
(such as ASMS, HUPO, or EuPA) and a yearly series of EuBIC conferences 
has started, alternating between the EuBIC Winter School and the EuBIC 
developer’s meeting. This is one of the main events in the bioinformatics 
proteomics community in Europe, where renowned researchers present and 
discuss current issue and challenges present in this field. All results and a 
summary of these events have been found valuable to be published in the 
Journal of Proteomics. Sections and contain the publications from 
the preceding events in 2017 and 2018. 
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Chapter 3 


MS Amanda, a universal 
identification algorithm 
optimized for high accuracy 
tandem mass spectra 


This chapter contains the publication of the original MS Amanda algorithm, 
published in the Journal of Proteome Research, 2014. The algorithm is 
compared to state of the art algorithms showing increased performance by 
identifying a significant amount of additional PSMs and peptides at the 
same FDR [24]. 

Reprinted with permission from Dorfer, V.; Pichler, P.; Stranzl, T.; 
Stadlmann, J.; Taus, T.; Winkler, S.; Mechtler, K. MS Amanda, a Uni- 
versal Identification Algorithm Optimized for High Accuracy Tandem Mass 
Spectra. J. Proteome Res. 2014, 13 (8), 3679-3684. 
http://pubs.acs.org/articlesonrequest / AOR-6Dy VQ3j4Y TcGX yaskJvi. 
Copyright 2014 American Chemical Society. 
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ABSTRACT: Today’s highly accurate spectra provided by modern tandem mass 
spectrometers offer considerable advantages for the analysis of proteomic samples of 
increased complexity. Among other factors, the quantity of reliably identified peptides is 
considerably influenced by the peptide identification algorithm. While most widely used 
search engines were developed when high-resolution mass spectrometry data were not 
readily available for fragment ion masses, we have designed a scoring algorithm particularly 
suitable for high mass accuracy. Our algorithm, MS Amanda, is generally applicable to 
HCD, ETD, and CID fragmentation type data. The algorithm confidently explains more 
spectra at the same false discovery rate than Mascot or SEQUEST on examined high mass 
accuracy data sets, with excellent overlap and identical peptide sequence identification for 
most spectra also explained by Mascot or SEQUEST. MS Amanda, available at http://ms. 
imp.ac.at/?goto=msamanda, is provided free of charge both as standalone version for 
integration into custom workflows and as a plugin for the Proteome Discoverer platform. 
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E INTRODUCTION 


Mass spectrometry (MS)-based proteomics has evolved into an 
indispensable approach in biological sample analysis.” In 
shotgun proteomics experiments, proteins are proteolytically 
cleaved to peptides, separated based on specific physico- 
chemical properties, and subsequently analyzed in a mass 
spectrometer. 

Obtained spectra, containing mass-to-charge ratios of either 
charged peptides (MS') or fragment ions (MS/MS or MS?) 
associated with respective ion intensities, are matched to 
candidate peptides, and a score dependent on an identification 
algorithm is assigned to each peptide spectrum match (PSM). 

Scoring algorithms such as Mascot) SEQUEST,* X- 
Tandem,” Andromeda,° OMSSA/ MyriMatch,? Phenyx,” or 
Morpheus" incorporate various strategies to evaluate the 
quality of a PSM. In particular, SEQUEST reports a cross- 
correlation score of the acquired mass spectrum matching a 
modeled peptide spectrum. In comparison, Mascot estimates 
the probability that a particular peptide spectrum match is a 
random event by probabilistic modeling. Other search engines 
are specifically designed for a particular purpose such as for the 
analysis of post-translationally modified peptides (e.g, Mod- 
ifiComb'' or InsPecT'?). 


ww ACS Publications © 2014 American Chemical Society 
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Recent technological advance of instruments allows high- 
throughput identification of thousands of proteins, ^'^ which is 
a prerequisite for the challenging analysis of complete 
proteomes. Tackling the complete yeast proteome, the Mann 
group was able to detect more than 2000 proteins in 48 h in 
2006.5 Only a few years later, both the Mann group in 2012 as 
well as Coon and co-workers in 2013 described comprehensive 
analyses of the nearly complete yeast proteome at manifoldly 
decreased runtimes.'”'” The continuous increase in throughput 
and precision enables the research community to address 
previously unsolvable scientific challenges, such as the in-depth 
analysis of mammalian proteomes.'? Recent studies identified 
more than 10 000 human proteins in the proteome of a human 
cancer cell line, which is suggested to be close to 
completion. ^ ?! 

Technological development of instruments leads to more 
reliable data subsequently used by MS search engines for the 
assignment of potential peptides to spectra.” While newer 
instruments deliver potentially more MS/MS spectra per time 
unit, typically only up to 6096 of these spectra are confidently 
assigned to peptides, suggesting a potential for improve- 
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ment.??* We further consider the emergence of high- 
resolution instruments with highly accurate mass record- 
ings?” as a stimulus for the development of peptide search 
algorithms particularly suitable to such data. 

We here describe MS Amanda, a novel database search 
engine, specially developed for high-resolution tandem mass 
spectrometry data, taking advantage of high mass accuracy and 
considering fragment ion intensities. To show the general 
applicability of MS Amanda, the performance of the algorithm 
was evaluated on HCD, ETD, and CID fragmentation type 
data. 


Bl MATERIALS AND METHODS 


MS Amanda Identification Algorithm 


We have designed MS Amanda based on a binomial 
distribution function incorporating peak intensities and 
determining favorable outcomes (successes) and possible 
outcomes (sample space) in a specific manner. Our multi- 
threaded implementation in C£ incorporates the described 
identification algorithm. 

During preprocessing, peaks corresponding to precursor ions 
are removed and an optional de-isotoping of fragment ions is 
applied (intensities of discarded isotopes are added to C12 
peaks). In order to discriminate ion signals from noise, peak 
picking is performed. In each 100 Da window, the m most 
intense peaks are picked, where m is a value between 1 and 10. 
All possible values for m are tested, and the value representing 
the maximum PSM score is selected.” 

Theoretical fragment ions of each candidate peptide, thus, of 
all peptides in the (forward or decoy) database that match the 
precursor mass of a certain spectrum considering a specific MS! 
mass tolerance, are matched to E, the set of picked peaks, 
allowing a given MS? mass tolerance (t). The first part of the 
scoring algorithm used in MS Amanda is based on a cumulative 
binomial distribution function defined as 


N/N 
Pa Ne | ra - y 
d 2 k : (1) 


that is, the probability to match at least n out of N peaks by 
chance. This formula assumes that the random variable 
denoting the number of matched peaks follows a binomial 
distribution as the sum of Bernoulli random variables X; {i = 
1,...N}. For each X; p is the probability to match one peak by 
chance (see formula 3). In our usage of the cumulative 
binomial distribution function, n is the number of matched 
peaks, and N is the number of picked peaks. We assume 
independence of the X;. 

The probability p to match one peak by chance is the fraction 
of the m/z range that is covered by the theoretical ions f(pep) 
and the total mass window (first peak to last peak in the 
experimental spectrum) considering peak picking depth m. The 
covered m/z range of f(pep) is based on fragment ion tolerance 
t, considering solely fragment masses in the mass range of the 
first peak (e,(s,m)) and the last peak (ey(s,m)) of spectrum s. 
Given the set F, which are all theoretical fragment ions f(pep) 
within the mass of the first and the last picked peak of the 


experimental spectrum considering the fragment ion tolerance f 
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= {f (pep)I(e,(s, m) — t) 

< f (pep) 

< (ex(s, m) + t)} 

probability p is defined as 

(IF(s, pep, m)! x 2t) — O(F(s, pep, m)) 
(ex(s, m) + t) — (es, m) — t) 


F(s, pep, m) 


(2) 


p(s, pep, m) = 


(3) 

The overlap O(F(spep,m)) is the sum of all overlapping 

ranges in the theoretical spectrum F considering mass tolerance 

t. With peaks f; sorted by m/z in ascending order, this overlap 
between consecutive peaks f; and f; is calculated as 


dd x > 2t 


else 


0 
fr) = 
a (4) 


IFI- 1 


O(F) = % off, f.) 


i=1 


(5) 


where o(f, fi) is the overlap between two consecutive 
fragment ions f; and fẹ, For a graphical illustration see 
Supporting Information Figure S1. 

P(n,p,N) indicates the reliability of a peptide spectrum match 
under the null hypothesis of a random match based on a 
binomial distribution. As a consequence, more reliable PSMs 
are characterized by a low probability (for randomly matching 
peaks). To improve the distinction between false and correct 
identifications, we additionally consider the intensities of the 
peaks: The calculated probability to match at least n out of N 
peaks by chance is weighted by the reciprocal of the explained 
ion current eif(s,pep,m). 


D ceu pai) I(x) 


Denn IQ) 


eif (s, pep, m) = T 
6 


eif(spep,m) is the fraction of the sum of the intensities I(M) of 
the matched peaks M (IMI = n) and the sum of the intensities 
I(E) of all picked peaks E (IEI = N). The weighting rewards 
peptides matching more intense peaks over those matching less 
intense peaks. 

Finally, the quality of the match of peptide pep with spectrum 
s is represented by the MS Amanda score S(spep). The score 
S(spep) is the basis for further false discovery rate (FDR) 


estimation. 


We compared the performance of MS Amanda based on four 
data sets: an HCD HeLa sample, a synthetic peptide library, a 
histone data set, and a CID HeLa sample. The HCD HeLa 
sample, published by Michalski et al, consists of three 
replicate measurements of tryptic peptides derived from one 
human cancer cell line. The synthetic peptide library, as 
described by Marx et al?! is composed of more than 200 000 
phosphorylated and nonphosphorylated peptides. Performance 
comparisons were based on provided HCD and ETD data. The 
histone data set is composed of four different preparations, 


P(s, pep, m) 


a, -10 X lo 
(s, pep) | l eif (s, pep, m) 


(7) 
Data Sets 
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namely, Histone II-A from calf thymus (Sigma), Histone III-S 
from calf thymus (Sigma), Histone IV from Xenopus laevis, 
recombinantly expressed in Escherichia coli (Upstate), and Core 
Histones from chicken erythrocytes (Millipore). The published 
CID HeLa sample?” covers three replicates measured with a 1 h 


gradient (1 ug). 
Histone Sample Preparation 


Samples were reduced and alkylated using dithiothreotiol 
(DTT; 2 mM, final concentration) and methyl methanethio- 
sulfonate (MMTS; 5 mM final concentration). Proteins were 
digested overnight with endoproteinase Glu-C (from Staph- 
ylococcus aureus V8, Sigma) in 100 mM ammonium bicarbonate 
at 37 ?C. 

Peptides were separated on a reversed-phase column 
(Acclaim PepMap RSLC column, 2 u, 100 À, 75 um x 500 
mm, Thermo Fisher) by a linear gradient from 0.8 to 3296 
acetonitrile in 0.196 formic acid over 30 min on an RSLC nano 
HPLC system (Dionex). The eluting peptides were directly 
analyzed using a hybrid quadrupole-orbitrap mass spectrometer 
(QExactive, Thermo Fisher). The QExactive mass spectrom- 
eter was operated in data-dependent mode, using a full scan 
(m/z range 350-2000, nominal resolution 140 000, target 
value 1 x 10°) followed by MS/MS scans of the 12 most 
abundant ions. MS/MS spectra were acquired at a resolution of 
17 500 using normalized collision energy 30%, isolation width 
of 2, and the target value was set to 5 X 10*. Precursor ions 
selected for fragmentation (charge state 3 and higher) were put 
on a dynamic exclusion list for 10 s (dynamic exclusion 
tolerance is 10 ppm on QExactive by default). Additionally, the 
underfill ratio was set to 2096, resulting in an intensity threshold 
of 2 x 10*. The peptide match feature and the exclude isotopes 
feature were enabled. 


Database Search Settings 


Proteome Discoverer version 1.4.288 (PD) was used for 
peptide identifications. All data sets were searched with Mascot 
(version 2.2.1), SEQUEST (with probability score calculation) 
as provided in PD, and MS Amanda. Advanced search settings 
in PD were changed from default in order to store all PSMs in 
the result file (all cutoff filters and thresholds were disabled). 

Searches for the HeLa and the histone data sets were 
performed with 7 ppm precursor mass tolerance and 0.03 Da 
fragment ion mass tolerance (0.5 for CID). Following Marx et 
al, we used 5 ppm precursor mass tolerance and 0.02 Da 
fragment mass tolerance for the synthetic peptide library. For 
HCD and CID, considered fragment ions were left at defaults 
for Mascot and SEQUEST, and set to b and y ions for MS 
Amanda. ETD searches with Mascot and MS Amanda were 
performed using c, y, z + 1, and z + 2 ions. 

For the HeLa data sets, oxidation(M) was set as variable 
modification, carbamidomethyl(C) as fixed modification, and 
trypsin as enzyme allowing up to two missed cleavages. The 
peptide library was searched with oxidation(M) and 
phosphorylation(S,T,Y) as variable modifications and up to 
four missed cleavage sites for trypsin. 

Variable modification settings for the histone data set were 
oxidation(M), phosphorylation(S, T, Y), methyl(K,R), dimethyl- 
(KR), trimethyl(K), and acetyl(K). Methylthio(C) was set as 
fixed modification, GluC (C-terminal cleavage after D or E) as 
enzyme, and two as the maximum number of missed cleavages. 

Performance comparisons were based on 196 EDR.**** We 
generated concatenated forward and reverse (decoy) protein 
databases with contaminants using MaxQuant Sequence 
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Reverser (v1.0.13.13).'* We searched the HeLa data sets 
against Swiss-Prot_human* (release 2013 10), merged the 
synthetic peptide sequences with Swiss-Prot_human for the 
peptide library, and searched the histone data against the 
complete Swiss-Prot (release 2013 10). For FDR calculation, 
peptides shorter than 7 amino acids were discarded and 
conservative FDR estimation was ensured by preferring the 
decoy peptide to an equally scored peptide. Peptide grouping 
for unique peptide level FDR estimation was solely based on 
the peptide sequence, and the highest score was kept for each 
peptide group. 


B RESULTS 
We compared PSM and peptide identifications of MS Amanda 
to Mascot and SEQUEST, two search algorithms widely used 


for peptide identification in mass spectrometry. Performance of 
MS Amanda was evaluated on an HCD Hela set (Figure 1), on 
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Figure 1. Performance comparison on HCD HeLa data set.” The 
previously published data set is composed of three replicates measured 
on a Thermo Fisher QExactive instrument. For all three replicates, 
consistently more PSMs were identified at 1% FDR (PSM level) with 
MS Amanda as compared to Mascot or SEQUEST. 


a synthetic peptide library (Figure 2), a histone data set (Figure 
3), and on a CID HeLa set. In addition to PSM identifications 
based on a forward decoy database approach at 1% FDR, we 
show results for unique peptides at 1% FDR in Supporting 
Information Table S1. 

Performance of MS Amanda 


For HCD data, the numbers of identified PSMs by Mascot, 
SEQUEST, and MS Amanda are depicted in Figure 1 for the 
HeLa data set and Figure 2(A,B) for the synthetic peptide 
library. Results for the histone data set are shown in Figure 3. 
We report identified PSMs in the synthetic peptide library 
separately for nonphosphorylated (Figure 2A) and phosphory- 
lated (Figure 2B) peptides. 

Consistently higher quantities of PSM identifications were 
observed for MS Amanda as compared to both Mascot and 
SEQUEST for all high-resolution data sets. In the three HCD 
HeLa replicates, we identified between 11 and 22% more PSMs 
with MS Amanda compared to Mascot and SEQUEST. 

While SEQUEST performed slightly better than Mascot on 
the nonphosphorylated peptide library subset (2A), the 
reciprocal situation was observed on the phosphorylated 
peptide library subset (2B). Still, MS Amanda outperformed 
both search engines for both subsets by 4—22%. 
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Figure 2. Identified PSMs in a synthetic peptide library comprising HCD and ETD data.?' Applying MS Amanda led to the highest number of 
identified PSMs on the HCD data set for both nonphosphorylated (A) and phosphorylated (B) peptides. A similar performance increase was 
observed on the ETD data set for nonphosphorylated (C) and phosphorylated (D) peptides. 
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Figure 3. Performance comparison of identified PSMs in a histone 
data set. We used four different histone preparations originating from 
three species and measured them on a Thermo Fisher QExactive mass 
spectrometer. HCD raw files were combined for peptide identification. 
At 196 FDR, we identified more PSMs with MS Amanda as with 
Mascot and SEQUEST. 


For the histone data set, we identified 620 target PSMs with 
Mascot and 778 with SEQUEST. By applying MS Amanda we 
identified 969 PSMs, which corresponds to a performance 
increase in identified PSMs of 56 and 25%, respectively. 

We further analyzed the performance of MS Amanda to 
Mascot on the peptide library ETD data subset. Both search 
algorithms identified considerably more PSMs than SEQUEST, 
a comparison with SEQUEST on the ETD subset was therefore 
omitted. 

In accordance with our analysis of the HCD data subset, we 
report both PSMs of nonphosphorylated (Figure 2C) and 
phosphorylated (Figure 2D) peptides. While we identified 13 
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489 PSMs of nonphosphorylated peptides with Mascot in the 
ETD data, we found notably more PSMs (16400) with MS 
Amanda, which is a 2296 increase in identified PSMs at 196 
FDR. For the phosphorylated subset, we found a comparable 
trend. Here, we identified 12016 PSMs with Mascot and 12 
979 PSMs with MS Amanda (an increase of 896). 

Benchmarking MS Amanda, Mascot, and SEQUEST on the 
low-resolution CID data reported comparable performance for 
all three search engines, with slightly higher PSM identification 
rates for MS Amanda (1—596; see Supporting Information 
Table S2). 

We list the numbers of identified PSMs for all three high- 
resolution data sets in Supporting Information Table S2. In 
Supporting Information Table S1, we show identified unique 
peptides at 196 FDR (peptide level) for the HCD and CID 
HeLa data set and for the HCD and ETD peptide library data 
sets. The limited number of proteins in the histone data set did 
not allow for accurate peptide level FDR estimation. On these 
data, we only report PSM level FDR estimation. 

For completeness, we also compared the performance of MS 
Amanda with the noncommercial search engine Morpheus, a 
recently described search algorithm which was also specifically 
designed for high mass accuracy MS? spectra (see Supporting 
Information Table S3). 


PSM Overlap 


To show the validity of our approach, we investigated the 
overlap in target PSM identification for all three search 
algorithms. Analyzing one replicate of the HCD HeLa data set 
(MS Amanda 15 091 PSMs, Mascot 12 386 PSMs, SEQUEST 
12 858 PSMs), 9921 spectra were commonly identified by all 
three search engines (Figure 4). While MS Amanda identified 
considerably more unique PSMs than compared search engines, 
the capability of MS Amanda to identify large fractions of 
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Figure 4. Overlap of target PSMs based on one HCD Hela replicate. 
MS Amanda explains large fractions of PSMs also identified by Mascot 
and SEQUEST. Further, our algorithm explains many peptides 
otherwise uniquely identified by either Mascot or SEQUEST. 


peptides found by either Mascot or SEQUEST is noteworthy; 
9296 of the PSMs identified by Mascot and further 9296 of those 
identified by SEQUEST are reliably found by MS Amanda, 
while only 8096 of PSMs identified by SEQUEST and 8396 of 
PSMs identified by Mascot are also found by the respective 
other search engine. This highlights that MS Amanda is 
remarkably capable of explaining spectra otherwise uniquely 
identified by either Mascot or SEQUEST. 


i] DISCUSSION 


Current state-of-the-art mass spectrometers provide highly 
accurate m/z data of both intact peptides and fragment ions. 
These instruments were not readily available at the time when 
Mascot and SEQUEST were developed. Stil, Mascot and 
SEQUEST are among the most widely used search engines and 
perform generally well for both low- and high-resolution data. 
Here we present MS Amanda, a peptide identification 
algorithm shown to outperform these established search 
engines on examined data sets. 

MS Amanda is based on a cumulative binomial distribution 
function, which estimates the probability to match n out of N 
peaks by chance. In our implementation of the cumulative 
distribution function, N is the number of picked peaks, and n 
the number of matching peaks (formula 1). We consider this 
strategy beneficial for spectra where the number of theoretical 
fragment ions is large (e.g, for spectra with many different 
types of neutral loss peaks). In addition, our estimation of the 
probability p to match one peak by chance (formula 3) 
provides the advantage that fragment ion tolerances can be 
specifed in parts per million. Further, our scoring system 
considers the intensities of all matched peaks for reporting the 
score of each potential peptide spectrum match. 

We found that MS Amanda provides an increased peptide 
identification performance in comparison to the well- 
established search engines Mascot and SEQUEST, as high- 
lighted both for HCD and ETD data sets (increase in PSMs 
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between 11 and 2296 on the HCD Hela set). The number of 
detected PSMs in a data set correlates with the number of 
unique peptides. More identified PSMs lead to potentially more 
identified peptides, which subsequently influences protein 
scoring and potentially increases the number of identified 
proteins. While MS Amanda uniquely identified many addi- 
tional PSMs, our search engine further incorporates large 
fractions of PSMs otherwise uniquely reported by either 
Mascot or SEQUEST. 

We suggest MS Amanda as particularly well-suitable for high- 
resolution data sets, as we observed a substantial performance 
gain for HCD and high mass accuracy ETD data. In addition, 
by showing small but consistent improvements for CID data, 
we further highlight its general applicability. We want to 
emphasize the performance of MS Amanda on our 
modification-rich histone data set, where we observed a 24— 
56% increase in identified PSMs. This observation suggests that 
one possible explanation for the increased performance might 
be that MS Amanda is particularly well-suited for the 
identification of peptides of large mass and higher charge 
state (charge states +4 to +8 constitute almost middle-down 
data). 

With its remarkably consistent performance and provided as 
downloadable version (standalone and integrated in PD), we 
believe that our ready-to-use implementation is of particular 
value for the proteomics community. MS Amanda is available at 
http://ms.imp.ac.at/?goto=msamanda. 
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Supplemental Table S1. Identified target peptides at 196 FDR. 


HCD HeLa Replicate 1 
HCD HeLa Replicate 2 
HCD HeLa Replicate 3 
HCD PepLib Phospho 
HCD PepLib No Phospho 
ETD PepLib Phospho 
ETD PepLib No Phospho 
CID HeLa Replicate 1 
CID HeLa Replicate 2 
CID HeLa Replicate 3 


Supplemental Table S2. Identified target PSMs at 196 FDR. 


HCD HeLa Replicate 1 
HCD HeLa Replicate 2 
HCD HeLa Replicate 3 
HCD Histone 

HCD PepLib Phospho 
HCD PepLib No Phospho 
ETD PepLib Phospho 
ETD PepLib No Phospho 
CID HeLa Replicate 1 
CID HeLa Replicate 2 
CID HeLa Replicate 3 


Mascot 
8,924 
10,025 
10,119 
26,495 
31,835 
6,196 
6,368 
5,288 
5,342 
5,576 


Mascot 
12,386 
13,761 
13,537 

620 
72,771 
103,999 
12,016 
13,489 
6,895 
7,093 
7,091 


SEQUEST MS Amanda 


9,417 
10,909 
10,530 
22,976 
34,861 


10,841 
12,101 
11,773 
27,414 
35,418 
6,890 
8,018 
5,442 
5,642 
5,746 


SEQUEST MS Amanda 


12,858 
14,628 
14,223 
778 
62,031 
112,491 


15,091 
16,305 
15,984 
969 
75,605 
118,491 
12,979 
16,400 
7,046 
7,409 
7,373 


Supplemental Table S3. Performance comparison of MS Amanda and Morpheus 
on HCD HeLa data set” at 1% FDR. Search settings and modifications were as 
described above: 7 ppm precursor mass tolerance, 0.03 Da fragment mass 
tolerance, Oxidation (M) as variable modification, Carbamidomethyl (C) as fixed 
modification, and trypsin as enzyme allowing up to two missed cleavages. 


Morpheus MS Amanda 
HCD HeLa Replicate 1 12,060 15,091 
HCD HeLa Replicate 2 12,019 16,305 
HCD HeLa Replicate 3 11,068 15,984 


TG fii) 


Fir 


Supplemental Figure S1. Schematic view of overlap calculation for fragments ions 
given mass tolerance t (see Formula 4). Each peak covers an m/z range of 2*t, 
leading to potential overlaps of covered range for nearby peaks. The probability 
to match a peak by chance is given by the fraction of covered range and total m/z 
range. For calculation of the covered range overlapping areas (marked areas in 
Figure) are subtracted from the sum of all peak ranges. 


Chapter 4 


CharmeRT: Boosting 
Peptide Identifications by 
Chimeric Spectra 
Identification and Retention 
Time Prediction 


'This chapter deals with the identification and validation of co-eluting pre- 
cursors in tandem mass spectra, so-called chimeric or mixed spectra. This 
approach has been published in the Journal of Proteome Research, 2018, 
where it is shown that the identification and validation of chimeric spectra 
leads to increased numbers of unique peptides and proteins, an information 
that is readily available in the measured data sets [23]. 

Reprinted with permission from Dorfer, V.; Maltsev, S.; Winkler, S.; 
Mechtler, K. CharmeRT: Boosting Peptide Identifications by Chimeric Spec- 
tra Identification and Retention Time Prediction. J. Proteome Res. 2018, 
17 (8), 2581-2589. 
https: //pubs.acs.org/articlesonrequest / AOR-rQDdFpzevxqMWDkueMPv. 
Copyright 2018 American Chemical Society. 
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RNA expression intensities for HeLa proteins 


ABSTRACT: Coeluting peptides are still a major challenge for the — .. 
identification and validation of MS/MS spectra, but carry great 
potential. To tackle these problems, we have developed the here 
presented CharmeRT workflow, combining a chimeric spectra 
identification strategy implemented as part of the MS Amanda 
algorithm with the validation system Elutator, which incorporates a 
highly accurate retention time prediction algorithm. For high- 


resolution data sets this workflow identifies 38—6496 chimeric spectra, 
which results in up to 63% more unique peptides compared toa " | | 
| | N ul Ali TN u 1 


probabiity density 


conventional single search strategy. 
oo 


log'O(Expression) 


al HeLa expreset proteins (RNA gene data Bil dentfed inthe frst search licere exclusively the second search 


KEYWORDS: tandem mass spectrometry, MS/MS, database search, chimeric spectra, mixed spectra, retention time prediction, 
validation 


B INTRODUCTION We here propose a combination of identifying chimeric 
spectra and validating detected mPSMs using retention time 
prediction, jointly leading to a significant increase in validated 
unique peptides for each data set accompanied by higher 
coverage of low abundant proteins: the CharmeRT workflow. 


Advancements in mass spectrometer instrument precision and 
acquisition time!" made mass spectrometry the primary 
instrument in proteomics analyses. The interpretation of the 
measured spectra is often performed using a database search 
algorithm." ^ Most database search algorithms stick to the E METHODS 
"one-spectrum-one-peptide" paradigm, although the occurrence 


of coeluting peptides and the accompanied challenges of CharmeRT Workflow 

chimeric spectra have been widely studied. ? Even though The first part of the CharmeRT workflow identifies chimeric 

several solutions for processing chimeric spectra already spectra using a second search approach in our database search 

exist? '^ they are still often not used in an everyday engine MS Amanda. ^ The second part of CharmeRT validates 

proteomics workflow. In addition, the validation of more than the identified PSMs of first and second searches using Elutator, 

one peptide match per spectrum (here called mPSM) is an a newly developed tool based on the principles of Percolator,” 

important task,'? as the confidence score for the most abundant featuring a new approach for retention time prediction. An 

peptide in a spectrum is not easily comparable to the score of a overview of the workflow can be seen in Figure 1. 

second coeluting peptide also present in the spectrum. Chimeric Spectra Search in MS Amanda 

However, through ignoring this valuable information a large To identify multiple peptides per spectrum, a second search 

amount of unique peptides remains unidentified, as recent approach was implemented in the database search engine MS 

studies show that about 5096 of all spectra contain more than Amanda. For each spectrum, all peaks of the highest scoring 

one peptide. ^? peptide identified in the first search are removed. Basis of this 
In general, the dynamic range of proteins is a big challenge in removal are the selected fragment ions in the search, 

proteomics experiments." Detecting highly abundant proteins additionally neutral loss ions can be removed as well. As 

is a lot simpler than identifying the least abundant part of the interfering peptides may have the same c- or n-terminal amino 

proteome. ^'^ Many approaches have been conducted to acid due to the used enzyme, leading to a shared y1/b1 ion in a 

increase proteome coverage and enable deep proteome 

analysis, ^ ^^ being more or less straightforward and affordable Received: November 22, 2017 

techniques for an everyday proteomics workflow. Published: June 4, 2018 
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Figure 1. Overview of the CharmeRT workflow. After a first search 
round with MS Amanda, spectra are cleaned, potential interfering 
precursors are identified, and spectra are submitted to a second search 
round. Resulting PSMs of the first and the second search are validated 
by Elutator using a retention time model. 


mixed spectrum, yl ions can optionally be kept, and b1 ions are 
not considered at all by MS Amanda. Tests showed that all 
other potentially shared peaks can be neglected, as they are very 
unlikely. We identified an average overlap of 0.7%, see 
Supplemental Table S2. Corresponding MS1 spectra are 
investigated and potential interfering precursors are deter- 
mined, optionally performing a preceding deisotoping of the 
MSI spectrum. There are several ways to treat precursor peaks 
where the charge state cannot be determined: not considering 
them, testing various selectable charge states, or only testing the 
most abundant ones of them at different charge states. AII 
spectra are submitted to a further search lap testing each of the 
identified precursors with the option to research the original 
precursor. For each spectrum, multiple second search hits, i.e., 
the best n PSMs for the top m precursors, are reported. 


mPSM Validation in Elutator 


The second part of the CharmeRT workflow is realized by 
Elutator, a new tool for validating identified mPSMs. Elutator is 
based on the principles of Percolator” and validates mPSMs 
using a set of features optimized for the analysis of MS Amanda 
results. A complete list of all used features is given in the 
supplemental data (Supplemental Table S1), including the 
deviation of an estimated peptide elution retention time (RT) 
from the actual value, as well as recalibrated masses for 
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precursor and fragment ions. The most important features are 
explained in the next sections. 

Elutator Retention Time Prediction Model. An 
important factor in the context of validating mPSMs is the 
difference between predicted and measured retention times. 
Several approaches already exist to construct RT prediction 
models.” °° However, the use of these models for validation is 
often limited due to specific requirements, such as, a significant 
amount of training data and correct handling of chemical 
modifications. We have therefore developed a new retention 
time prediction algorithm: Elutator's RT model is based on the 
SSRCalc?? model and estimates the hydrophobicity index of 
peptides based on their sequences and chemical modifications, 
which can be linearly mapped to retention time. It was 
significantly redesigned and extended for better performance 
but preserves most of the features and ideas of the original 
SSRCalc algorithm. The features used for predicting the model 
include peptide length, certain properties for special amino 
acids (e.g., Proline), the isoelectric charge, properties for short 
peptides, or parameters for hydrophobic amino acid patterns 
likely forming helices, and are similar to the features described 
by Krokhin.? 

An important improvement compared to the original model 
of SSRCalc for retention time prediction is the consideration of 
neighboring effects of amino residuals being not restricted to 
nearest neighbors only. Experiments showed a statistically 
significant effect of amino residual interactions even for 
residuals separated by several positions in the polypeptide 
chain. A detailed description on how we model these 
interactions is given in the Supporting Information. 

The described features are used in an optimized nonlinear 
retention time model implemented in Elutator. The original 
formulation of the model was given by Krokhin for SSRCalc.^! 
The parameters (coefficients) of the used model are optimized 
using Newton's method of minimizing the sum of squares of 
retention time deviations for all peptides in the training sets. A 
detailed information on the model calculation is given in the 
supplemental data. The optimization procedure assumes 
simultaneous training over several different data sets measured 
under similar elution conditions (gradient duration, chemical 
composition of eluents, column temperature, etc.). To avoid 
overfitting, the retention time model has been trained using 
94122 highly reliable PSMs (FDR threshold was 0.001) 
corresponding to 44 271 unique sequences obtained from in- 
house measured data sets of different organisms: trypsin 
digested human (HeLa), mouse, yeast, B. subtilis, E. coli, 
phosphorylated peptides from TiO, enriched human cell lysate, 
and chymotrypsin digested human data set. After a preliminary 
optimization, we removed 0.196 of the outliers, corresponding 
to the number of expected false matches, and repeated the 
optimization. By considering additional peptide properties, 
such as the interactions of neighboring amino residuals in the 
peptide chain, we considerably increased the RT prediction 
accuracy (Figure 2). A similar accuracy of retention time 
prediction was achieved for phosphorylated and unmodified 
peptides (see Supplemental Figure S1). 

For practical usage, the applicability of the trained model on 
data sets measured under a different chromatographic setup is 
of high interest. Elutator maps the predicted hydrophobicity 
index to the observed retention time by applying a linear fitting 
for all peptides in a single HPLC run. This allows for an 
application to data sets with different setups. We investigated 
this using a publicly available externally measured HeLa data 


DOI: 10.1021/acs.jproteome.7b00836 
J. Proteome Res. 2018, 17, 2581-2589 


Journal of Proteome Research 


(a) 


Elutator 


1504 
r3 
E 

E 100- 
E] 
D 
2 
{>} 
= 

50- 

Jo? - 6.29 min 
R? - 0.972 
0 i 1 T Li T 
0 20 40 60 
Hydrophobicity Index 
[-] SSRCalc online 

150 - 
(a 
£ 

T 100- 
o 
Q 
5 
[2] 
[o] 
z 

504 


Jo? - 8.87 min 


R?=0.944 


LI Li LI I 
5 10 15 
Hydrophobicity Index 


(b) BioLCCC online 


Measured RT (min) 


Yo =14.4 min 


R?=0.851 
0 E Li Li I Li Li 
0 50 100 150 200 
Predicted RT (min) 
(d) Elude, testing subset 
150 = 
€ 
E 
E 100- 
Ee! 
oO 
5 
Ww 
oO 
= 
50 = 
dc? = 8.46 min 
R? 20.949 
M E 


LI Li I 
50 100 150 


Predicted RT (min) 


Figure 2. Comparison of elution retention time prediction models: (a) Elutator, (b) BioLCCC," (c) SSRCalc,™” and (d) Elude.” Depending upon 
the model design the output is either an absolute retention time or a relative hydrophobicity index, which can be linearly mapped to the retention 
time in a particular data set. We here compare the correlation of predicted and measured retention times of data set I, which is important for 


validation. R? is the coefficient of determination, and v o? is the dispersion of the error in minutes. As Elude cannot be trained on multiple raw files, 
we here used 5096 randomly chosen PSMs over all raw files for training and the others for testing. 


set.? The accuracy of the retention time prediction is lower for 
the external data set, as can be seen through the correlation 
coefficient R2. Nevertheless, as demonstrated in Supplemental 
Figure S2, using retention time prediction also here leads to a 
higher number of PSMs. Smaller retention time dispersion for 
the external data set can be explained by the shorter gradient 
(90 min versus 180 min for the in-house data set). The smaller 
gradient duration leads to a proportional decrease of retention 
time deviations. Alternatively, a new model can be easily trained 
for specific elution conditions using the Elutator RT Trainer 
(see Availability). 

Combined Retention Time Score. Besides the deviation 
of the predicted RT to the measured RT, Elutator also uses the 
combined retention time score as feature for mPSM validation. 
It includes the PSM score of the search engine and the 
retention time deviation obtained from the retention time 
model. To calculate a combined score, the MS Amanda score is 
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recalibrated on the posterior error using linear regression to 
define coefficients a and b using the model 


—10log(f(A)) z& aA + b 


where f(A) is the probability for a match with score A to be 
false (i.e., local FDR), and A is the MS Amanda score. 

After this calibration, the combined score is calculated using 
the following scoring function: 


T1 E 
Scom ine = aA + b + ma 10 lo —— —erf| —— ,0 
C | (z: (x) 


where o is the dispersion of the predicted retention time, 
calculated considering highly reliable matches (FDR = 0.001), 


T is the duration of the linear part of the gradient, erf is the 


š " lAtl n 
Gauss error function, and € is defined as € = E where At is 
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Figure 3. Comparison of identification results of HeLa data sets measured with various isolation widths and gradient times analyzed with the 
CharmeRT workflow. We analyzed triplicates of tryptic HeLa samples for 2 m/z, 4 m/z, and 8 m/z isolation width, each either at a gradient time of 1 
h or 3 h. Results are given for 196 FDR calculated at peptide level, showing the (a) number of identified PSMs in the first and in the second search 
and (b) number of unique peptides identified only in the first, only in the second, and in both searches. 


the retention time deviation from the predicted value for the 
scored peptide. 

Calibration of Mass Differences. The aim of calibrating 
mass differences is to eliminate constant biases in mass 
measurements for precursors and fragments to enhance the 
mass resolution and is included as additional feature for mPSM 
validation. In Elutator this calibration is based on theoretically 
known masses of highly reliable matches of the first search 
(FDR = 0.001, calculated on MS Amanda score). 

Recalibration can be done for measured deviations of m/z 


values, A(), as well as for relative mass deviations, Army. 
z 


Elutator uses the following approximation of mass deviations 
over retention time f and m/z to determine the calibration 
coefficients a, b, and c: 


a(@) xaxt+b,x(™) +4 
zZ z 


Amy, 


m 
mea xtthx(™) +e 
Z 

Results of mass recalibration for a human data set? are 
presented in Supplemental Figure S3. This data set was 
analyzed with lock mass disabled (available in Q Exactive 
instruments, Thermo Fisher Scientific). Constant bias and 
variable error seemed to be similar in this case. Activating the 
lock mass option partly eliminates a constant bias, but increases 
a variable error because it is based on measuring the mass of 
known ions present in the spectrum. Therefore, we suggest that 
disabling the lock mass is preferable for better mass resolution 
when PSM validation by Elutator is used. 

Longest Consecutive Series A + B + Y. We introduce a 
combined consecutive sequence of N- and C-terminal ions as 
additional feature for validation, namely the sequence of a, b, 
and y ions, which typically constitute HCD/CID spectra. PSMs 
with scores close to the FDR threshold contain relatively few 
matched fragment peaks; therefore, y ions are likely not able to 
form any consecutive sequence. However, longer sequences can 
be potentially constructed by taking into account a and b ions, 
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which fill gaps between y fragments (see Supplemental Figure 
S4). 


E EXPERIMENTS 


In House Data Generation 


Samples were reduced and alkylated using dithiothreotiol (1 ug 
DTT per 20 ug protein) and iodacetamide (5 ug per 20 ug 
protein). Proteins were predigested with Lys-C at 30 °C for 2 h 
(1 ug Lys-C per 50 ug protein in 6 M urea and 12 mM 
Triethylammonium bicarbonate buffer (100 mM Ammonium 
bicarbonate (ABC) buffer for mouse samples)) and digested 
overnight with trypsin (Promega, Trypsin Gold, Mass 
spectrometry grade) at 37 °C (1 ug trypsin per 30 ug protein, 
0.8 M urea in 45 mM Triethylammonium bicarbonate buffer 
(mouse: 2 M urea with 100 mM ABC buffer)); digestion was 
stopped by adding concentrated TFA to a pH of approximately 
2. Phosphorylated peptides were enriched following the in- 
house TiO, enrichment protocol" HeLa peptides were 
obtained following the in-house HeLa protocol."* 

The HPLC system used was an UltiMate 3000 HPLC RSLC 
nano system coupled to an Q Exactive mass spectrometer 
(Thermo Fisher Scientific, Bremen, Germany), equipped with a 
Proxeon nanospray source (Proxeon, Odense, Denmark). 
Peptides were loaded onto a trap column (Thermo Fisher 
Scientific, Bremen, Germany, PepMap C18, 5 mm x 300 um 
ID, 5 um particles, 100 À pore size) at a flow rate of 25 uL/min 
using 0.196 TFA as mobile phase. After 10 minutes the trap 
column was switched in line with the analytical column 
(Thermo Fisher Scientific, Bremen, Germany, PepMap C18, 
500 mm X 75 um ID, 3 um, 100 Ä). Peptides were eluted using 
a flow rate of 230 nL/min. The eluting peptides were directly 
analyzed using hybrid quadrupole-orbitrap mass spectrometers 
(Q Exactive or Q Exactive Hybrid, Thermo Fisher). The Q 
Exactive mass spectrometer was operated in data-dependent 
mode using a full scan (m/z range 350—1650Th, nominal 
resolution of 70 000, target value 1E6) followed by MS/MS 
scans of the 12 most abundant ions. MS/MS spectra were 
acquired at a resolution of 17 500 using normalized collision 
energy 30%, isolation widths of 2, 4, or 8, and the target value 
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Figure 4. Comparison of MS Amanda and Elutator with other scoring methods and validation tools. Comparison was performed using (a) an 
external HeLa data set obtained from Michalski et al.” (data set H) and (b) an in-house data set of human Hela after TiO, enrichment of 
phosphorylated peptides (data set G). The FDR threshold of 196 was calculated at PSM level for consistency between different search tools, which 
typically operate at PSM level. In cases where several high confident matches were reported for the same spectrum, the match with best q-value was 
selected such that the number of PSMs corresponds to the number of confidently identified spectra. Elutator includes features derived from a peptide 
elution retention time prediction model. Model training was performed on in—house data sets, the same model was applied to in-house and external 


data sets. 


was set to SE4. Precursor ions selected for fragmentation 
(charge state 2 and higher) were put on a dynamic exclusion list 
for 10 s. Additionally, the underfill ratio was set to 2096, 
resulting in an intensity threshold of 2E4. 

Data Set Description 


To assess the quality of the CharmeRT workflow, we applied it 
to several different data sets (3 replicates each, measured on 
Thermo Q Exactive or Q Exactive Hybrid): several in-house 
HeLa tryptic digests with different isolation widths and 
different gradient times (data sets A-F, I), an in-house 
phospho-enriched HeLa tryptic digest (data set G), and an 
external HeLa tryptic digest ? (data set H). 


(A, B) HeLa tryptic digest, in-house measurement (Thermo Q 
Exactive Hybrid, 1 h gradient (A) and 3 h gradient (B), 
2 m/z isolation width, 1 yg, Figure 3). 

(C, D) HeLa tryptic digest, in-house measurement (Thermo Q 
Exactive Hybrid, 1 h gradient (C) and 3 h gradient (D), 
4 m/z isolation width, 1 ug, Figure 3). 

(E, F) HeLa tryptic digest, in-house measurement (Thermo Q 
Exactive Hybrid, 1 h gradient (E) and 3 h gradient (F), 
8 m/z isolation width, 1 ug, Figure 3). 

(G) HeLa tryptic digest, in-house measurement, phospho 
enrichment (Thermo Q Exactive, 3 h gradient, 2 m/z 
isolation width, 100 ng, Figure 4 and Figure S1, TiO, 
enrichment of phosphorylated peptides). 

(H) Hela tryptic digest, external measurement" (Thermo 
Q Exactive, 90 min gradient, 4 m/z isolation width, 5 
ug, Figure 4 and Figure S2). 

(1) HeLa tryptic digest, in-house measurement (Thermo Q. 
Exactive, 3 h gradient, 2 m/z isolation width, 100 ng, 
Figures 2 and S2) 
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Database Search Parameters 


When possible, runs have been performed in Proteome 
Discoverer 1.4, using Mascot version 2.2.7, MS Amanda v 
1.4.14.9288, and Elutator v 1.14.1.236. For results obtained 
with pParse, all raw files have been preprocessed with pParse 
version 2.0.8 and resulting files submitted to PD 1.4. MaxQuant 
results were obtained with version 1.5.5.1, and all settings were 
set to default values as this lead to the best performance. 

The following parameter settings have been used for MS 
Amanda, Mascot, and MaxQuant: swissprot database 2016—06 
(human/mouse) including the *cRAP" contaminants database; 
trypsin as enzyme; 2 missed cleavages; Carbamidomethyl(C) as 
fixed PTM; Oxidation(M) and (for the phosphorylated data 
set) Phospho(S,T) as variable modifications. For MS Amanda 
and Mascot 10 ppm precursor mass tolerance and 0.02 Da 
fragment mass tolerance were used. 

We applied the following additional settings specific for MS 
Amanda, where second search has been enabled: MS1 spectrum 
deisotoping set to false; keep yl ion, remove water losses, 
remove ammonia losses, and exclude first precursor set to true; 
top 5 results per precursor in Figures 3 and 4/top 10 results per 
precursor for Supplemental Figure S7. 

For Mascot we set the peptide cutoff score to 0. 

The Elutator FDR threshold was set to 196 on peptide level 
for results in Figure 3 and on PSM level for the experiments in 
Figure 4. For results in Figure 4, the match with the best q- 
value was selected in a case when several high confident 
matches were reported for the same spectrum, such that the 
number of PSMs corresponds to the number of confidently 
identified spectra. For all results obtained using Percolator, 
numbers were obtained applying an extra Proteome Discoverer 
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Figure 5. Comparison of protein expression values. Proteins identified in the second search (red) correspond in a higher proportion to low abundant 
proteins compared to proteins already identified in the first search (blue). Overall expression values for HeLa cells (gray) have been taken from 


ProteinAtlas."" 


node "Multi-confident PSMs fix”, available at http:/ /ms.imp.ac. 
at/?goto-charmert. MaxQuant results were filtered manually. 


B RESULTS 


CharmeRT Performance 


To demonstrate the performance of the CharmeRT workflow, 
we analyzed HeLa samples using different isolation widths 
during acquisition. In standard mass spectrometry experiments, 
very narrow isolation widths (<2 m/z) are applied to decrease 
the probability of coeluting peptides. However, being able to 
reliably identify multiple coeluting peptides per spectra reveals 
new possibilities for peptide identification and acquisition. By 
using broader isolation widths, we were able to considerably 
increase the numbers of identified peptides at a constant FDR 
(Figure 3). 

Applying the second search approach increased the number 
of reliable identifications for all tested isolation widths and 
gradient times. Even for narrow isolation widths (2 m/z) and 
small gradient times (1 h) we observed a considerable number 
of validated chimeric spectra, which increased the number of 
identified unique peptides by 4196 (5360 unique peptides). As 
expected, the amount of reliably identified PSMs and peptides 
in the first search decreases by 2—1596 for broad isolation 
widths (8 m/z, 14219 PSMs (1 h)/23138 PSMs (3 h)) 
compared to narrow isolation widths (2 m/z, 14 506 PSMs (1 
h)/27 340 PSMs (3 h)), as spectra complexity increases. This is 
alleviated by the chimeric approach, which identified almost the 
same number of unique peptides (20 438 (1 h)/28 550 (3 h) 
unique peptides) compared to the 2 m/z isolation width runs 
(18 566 (1 h)/31 346 (3 h) unique peptides). In our tests an 
isolation width of 4 m/z combined with a longer gradient 
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resulted in the highest number of identified peptides (33 138 
unique peptides) and the deepest insight into the investigated 
sample. This results not only in further evidence for already 
identified proteins, but also in additional proteins unidentified 
before (Supplemental Figure S6). Similar results can be 
achieved for an external data set: analyzing label-free data 
acquired at 1.4 m/z isolation width we see an average increase 
in PSMs of 7596, whereas for a TMT data set measured at a 
very narrow isolation width of 0.4 m/z only a small amount of 
chimeric spectra can be identified (see Supplemental Figure 
S5). 

On average, 3896 of the reliably identified spectra at 2 m/z 
isolation width (1 h gradient) were chimeric spectra 
(Supplemental Figure S7). This number increases to 5396 at 
an isolation width of 4 m/z (3 h gradient). Additionally, on 
average, almost 2096 of all reliably identified spectra at 4 m/z 
contain more than two peptides. Several examples of randomly 
drawn identified chimeric spectra of data set D are given in 
Supplemental Figures 511—518. 


Comparison to State of the Art Approaches 


The combination of chimeric spectra identification and mPSM 
validation using the power of accurate retention time prediction 
increased the number of identified PSMs (38373 PSMs 
(HeLa)/5463 PSMs (enriched phospho data set)) by up to 
12996 and considerably outperformed all other methods 
(Figure 4, Supplemental Table S3). Compared to the widely 
used combination of Mascot and Percolator (17916 PSMs 
(HeLa)/4088 PSMs (enriched phospho data set)), CharmeRT 
was able to identify 34—114% more PSMs and 25—62% more 
unique peptides. Mascot and Percolator can be additionally 
improved by using pParse, ^ which enables the detection of 
mixed spectra (23 841 PSMs (HeLa)/4488 PSMs (enriched 
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phospho data set)). Still, CharmeRT identified 22—6196 more 
PSMs than this combination. 

Compared to a single search strategy, the CharmeRT 
approach was able to identify 52—9096 more PSMs and 23— 
4596 more peptides. In addition, 29—3696 of all validated 
peptides identified in the first search could be confirmed using 
the second search. The efficacy of Elutator was much higher for 
matches identified in the second search, as the spectrum quality 
for coeluting peptides is lower and therefore the effect of 
including auxiliary information used in Elutator is higher: the 
increase in PSMs was 17—5196 for the first search and 106— 
14996 for the second search (see Supplemental Table S3 and 
Supplemental Figure S8). The overall positive effect of 
retention time prediction appeared to be 8—1596. Notably, 
the RT prediction model was applied to the externally 
measured data sets without any additional training. 

Only a minor amount of mixed spectra can be identified 
when the second search approach is used on phosphorylated 
sample. The validation through Elutator leads to 2596 
additionally identified PSMs in this case for the conventional 
single search compared to Mascot + Percolator. Chemical 
modifications hamper spectrum identification due to an 
increased combinatorial search space. However, only a small 
number of mixed spectra is expected in this case, as the 
enrichment of phosphorylated peptides with, for example, 
titanium dioxide (TiO,) reduces the overall complexity of the 
sample. 

We hypothesized that the additional peptides identified in 
the second search correspond to lower abundant proteins, 
which typically are difficult to be identified in standard shotgun 
workflows.'°"” If this hypothesis could be confirmed, the 
dynamic range of mass spectrometry measurements could 
effectively be expanded. To validate our assumption, we used 
publicly available RNA expression profiles of HeLa proteins." 
High reliable peptides identified in a single raw file (data set D) 
with a global peptide level FDR of 196 from first and second 
search were used to infer 4696 protein groups (Proteome 
Discoverer 1.4, no additional filters). 

For 4435 (9496) proteins, nonzero HeLa RNA expressions 
were found. The remaining proteins mainly correspond to 
contaminant proteins or proteins absent in the RNA expression 
database (Supplemental Table S4). Of the expressed proteins, 
885 (2096) were identified exclusively in the second search. The 
statistical distributions of expression levels of proteins identified 
in the first search and second search strongly indicate that 
activating second search shifts the sensitivity toward lower 
abundant proteins (Figure 5 and Supplemental Figure S9). As 
the correlation between protein and RNA abundance is only 
about 4096,99? we support this finding by additionally 
analyzing a publicly available spike in data set” (see 
Supplemental Figure $10). 


iil DISCUSSION 


We have shown that already in experiments with narrow 
isolation widths (2 m/z, 1 h and 3 h gradient) a large number 
of chimeric spectra exists (3996), indicating that coeluting 
peptides are a common issue in tandem mass spectra 
identification. Still, chimeric spectra generally remain uncon- 
sidered, as standard peptide identification workflows stick to 
the one-peptide-one-spectrum approach. By combining chi- 
meric spectra identification and appropriate validation. with 
retention time prediction, this challenge can be turned into a 
major chance. We are able to identify almost up to three-times 
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as many PSMs as compared to a standard workflow, leading to 
an increase of identified unique peptides of up to 6396 at 196 
FDR (peptide level). The CharmeRT workflow allows the use 
of wider isolation widths, which enable a deeper insight into 
measured samples. This indicates a possible expansion suitable 
for data-independent measurements (DIA). More importantly, 
CharmeRT increases the proteome coverage at unaltered 
acquisition time, enabling the identification of low abundant 
proteins at no extra cost, except for algorithmic runtime. As 
proteins with regulatory functions often occur at low 
abundance"! identifying them is essentially important for 
understanding and investigating cell mechanisms. By applying 
CharmeRT, we are able to expand the sensitivity range of mass 
spectrum analysis. 


Availability 


CharmeRT is freely available at http://ms.imp.ac.at/?goto= 
charmert for Proteome Discoverer 1.4 and 2.2. A version for 
Proteome Discoverer 2.3 and a standalone version are currently 
in progress and will be available soon. In addition, a tool for 
training RT models on user specific in-house columns is 
provided. 
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Elutator features 


Feature 


Description 


MS Amanda Score 


The PSM score assigned by the MS Amanda algorithm. 


Delta Score Difference of the scores between 1st and 2nd rank matches. Nonzero for 1st 
rank matches only. 
Delta Cn Normalized score difference relative to the first best scoring PSM of the 


spectrum. Zero for 1st rank matches and non-zero for rank 2 and above. 


Retention Time [min] 


Measured peptide retention time. 


Delta RT [min] 


Deviation of the measured retention time (time of spectrum scan) from the 
predicted. 


Absolute 


Delta RT [min] 


Absolute value of the delta retention time. 


Combined Score 


Combined score of the MS Amanda score and retention time deviation. 


96 Isolation Fraction of ion current in the isolation width not attributed to the identified 
Interference precursor. 

MH+ [Da] Singly charged mass of the peptide. 

m/z Measured m/z value. 

Calibrated Absolute calibrated deviation of the measured m/z from the theoretical value of 


Delta m/z [Th] 


the peptide. 


Calibrated 


Delta Mass [ppm] 


Calibrated deviation of the measured mass from the theoretical mass of the 
peptide in ppm. 


Peptide Length 


Length of the peptide in residues as a set of binary flags: 
length <= 6; length = 7; length = 8; length = 9; length = 10; length >= 11 


Charge State 


Precursor charge state, as a set of binary flags: z <= 2; z >= 3 


# Missed Cleavages 


Number of missed cleavages. 


Log Peptides 
Matched 


Logarithm of the number of candidates (search space) in the precursor mass 
window. 


Log Total Intensity 


Logarithm of the total ion current of the fragment spectrum. 


Fraction Matched 
Intensity [%] 


Fraction of the total ion current of the fragment spectrum that is matched by 
fragments of the PSM. 


Log Total Intensity of 
Fragments 


Similar to Log Total Intensity; peaks corresponding to precursor peaks (including 
isotopes) are excluded. 


Longest Consecutive 
Series Y 


Length of the longest consecutive matched sequence among the y fragment ion 
series peaks. 


S-3 


Longest Consecutive | Length of the longest consecutive matched sequence confirmed by a, b or y 
Series A+B+Y ions. 


Mean Squared 
Delta m/z for 
Fragments where Amj is the mass difference of measured and calculated masses of 
fragment i, and n is the number of identified fragments. 


; Am,” 
Average of squared mass errors of all fragments in Th, calculated as 5 = —, 


Top Y Fragment Absolute calibrated deviation of the measured m/z from the theoretical value 
Delta m/z for the top-intense y fragment. y1 and y2 ions are not considered. 
Second Top Y Absolute calibrated deviation of the measured m/z from the theoretical value 


Fragment Delta m/z | for the second top-intense y fragment. y1 and y2 ions are not considered. 


Table S1. All features used in Elutator to validate PSMs and peptides. 


Retention time prediction model 


Our retention time prediction model can be fully described as the following non-linear sequence 
dependent function, which has been described by Krokhin?!, 2006: 


H = F + newlso(seq,F) + helices1(seq) + helices2(seq) 


where H is the hydrophobicity, new/so is a function modeling the isoelectric charge, seq is the peptide 
sequence, helices1 and helices2 are adjustments for short and long helices, and F is defined by: 
F = sumScale(lengthScale(length) * R)) 


with sumScale being a polynomial function over the argument, /engthScale a polynomial factor 
dependent on the length of the peptide, length the length of the peptides sequence and R defined as: 
G 
R = G + smallness (—_) — undigested(sequence 
length g (seg ) 
—clusterness(sequence) — proline (sequence) 


with smallness being a correction factor depending on the length of the peptide, undigested a function 
to handle special positively charged amino acids (L/H/K), clusterness a function for handling clusters of 
hydrophobic amino acids, decreasing the hydrophobicity, proline a function to handle sequences with 
>=2 prolines in the peptide sequence, and G defined as: 


G = baseSumOf RetentionCoef f icients(sequence) + C(sequence) 


with baseSumOfRetentionCoefficients being the sum of all retention time coefficients of all amino 
residuals of the peptide sequence and C modeling the impact of neighboring amino acids (see below). 


Interactions between neighboring amino acids 


We describe the cumulative contribution of neighbor residual's interactions C for peptide sequence s 


to the hydrophobicity index as 
4 
CG)- >) m fGD* 
i k=0 


summed over all residues i. f (s, i) is defined as 


f= X», 3i-08GOvG) 
i-9<j<i+9,j#i 
The summation by i, j runs through all amino residuals in the sequence s with a maximal difference 
of +9 amino acid positions. For each amino acid pair, we consider two coefficients, 6 for the amino 
residual at position i in sequence s and y for the amino residual j in sequence s, such that the 
interaction between the amino residuals at positions i and j is described by the product f (s;)y(s;). 
Distance coefficients A(ö) = A(j — i) account for the contribution of residual pairs with a distance 6 


between them. All coefficients, including @_4, all lambda values, and the values of the lookup tables 
Band y are optimized during training of the RT model. 
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Figure S1 Correlation of theoretically calculated hydrophobicity index to the measured retention time for high 
confident matches (FDR=0.001) of in-house HeLa and TiO2 enriched data sets. 70% of all matches in the TiO2 enriched 
data set contain one or more phosphorylated sites. Outliers were not removed. 
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Figure S2 Correlation of theoretically calculated hydrophobicity index to the measured retention time for high 
confident matches (FDR=0.001) of in-house and external HeLa data set. Outliers were not removed. 
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Figure S3 Histogram of mass deviations for highly reliable identifications before and after recalibration, with disabled 
lock mass. External human dataset has been taken from Michalski et al.?? 
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Figure S4 Longest Consecutive Series A+B+Y. a4 and b3 ions confirm the sequence, filling gaps between y5 and y2 ions. The 
length of the Longest Consecutive Series A+B+Y is four in this case. 


Average fragment ion overlap in % between 

Dataset replicate first and second peptides second peptides 
A (1h, 2m/z) 1 0.7131 0.4035 
2 0.7394 0.4054 
3 0.6850 0.4000 
B (3h, 2m/z) 1 0.7182 0.3799 
2 0.7311 0.3917 
3 0.7137 0.3821 
C (1h, 4m/z) 1 0.7015 0.6210 
2 0.6925 0.6876 
3 0.6945 0.6440 
D (3h, 4m/z) 1 0.6830 0.6652 
2 0.6621 0.6914 
3 0.6860 0.7073 
E (1h, 8m/z) 1 0.6510 0.9318 
2 0.6346 0.9333 
3 0.6232 0.9497 
F (3h, 8m/z) 1 0.6620 1.0482 
2 0.6662 1.0300 
3 0.6569 1.0371 


Table S2 Shared ions between first and second peptides. Overlap of fragment ions given in percent between peptides 
identified in the first and in the second search or between peptides in the second search, when multiple precursors were 
identified in the second search. 
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Figure S5 Results for data of O'Connell et al.?* Data have been analyzed with Protein Discoverer 1.4.using 10ppm(a)/50 
ppm(b) precursor mass tolerance and 0.02Da/0.9Da fragment mass tolerance. a) Label -free data acquired at 1.4 m/z isolation 
width shows a high number of chimeric spectra that can be identified by CharmeRT. b) TMT data has been measured with an 
isolation window of 0.4 m/z showing a very low number of high confident interfering peptides. Still, the usage of CharmeRT is 
beneficial in this case as well, as it leads to the identification of more PSMs at the same FDR than the combination of MS 
Amanda and Percolator. 
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Figure S6 Protein evidence origin. Proteins from a single HeLa run (4m/z, 3h gradient) are investigated and identified peptides 
are analyzed. The major part of proteins can be confirmed by peptide identifications from both searches, some proteins are 
only found in one of the two search iterations. Protein inference and grouping has been performed with Proteome Discoverer 
1.4. 
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Figure S7 Presence of chimeric spectra in data sets with different isolation widths and gradient times. All spectra having 
two or more reliably identified precursors are chimeric spectra. As expected, the presence of chimeric spectra rises with 


increasing isolation width. 
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Figure S8 Proportion of second search PSMs for spike-in data^?. For low spike-in amounts, the proportion of UPS peptides is 
higher in the second search, as these originate from rare proteins and are therefore more likely to be coeluting peptides. 
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Figure S9 Score distributions of MS Amanda scores for target (blue) and decoy (red) peptides identified in the first (A) or 
second (B) search. The spectrum quality for co-eluting peptides is lower and the score distributions of target matches of the 
second search look very similar to decoy matches, so the effect of including auxiliary information used in Elutator is higher. 
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Figure S10 RNA abundance of HeLa proteins. All HeLa proteins are depicted in red, all proteins identified in the first search 
in light green, all proteins identified in the second search in dark green and proteins solely identified in the second search are 
purple. 
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Data | Figure | Method Ø Ø PSMs | Ø PSMs Ø Unique | Ø Ø Unique 
set PSMs | second | added peptides | Unique | peptides 
first search through first peptides | second 
search validation | search overlap | search 
A 3 CharmeRT 14506 10244 9445 3491 5360 
B 3 CharmeRT 27340 20725 14716 8368 8262 
C 3 CharmeRT 15918 17444 8308 5629 7715 
D 3 CharmeRT 27234| 32565 11131 11712 10295 
E 3 CharmeRT 14219 25409 5064 7500 7874 
F 3 CharmeRT 23138 44905 6266 13106 9178 
=. 14 u 4088 1371 2931 
Percolator 
G 4 MaxQuant 3525 398 2621 315 
G 4 pParse + 
Mascot + 4379 109 3140 0 
Percolator 
M ii Amen | agg 704 2679 
+ Percolator 
G 4 MS Amanda 
+ Elutator 4511 247 3032 152 66 
(no RT) 
G 4 CharmeRT 5128 335 3371 199 88 
m xs a 17916 4996 14201 
Percolator 
H 4 MaxQuant 15488 1284 12111 960 
H 4 pParse + 
Mascot + 16752 7089 13441 2999 
Percolator 
u ps EET 5313 | 14727 
+ Percolator 
H 4 MS Amanda 
+ Elutator 19720 15447 10330 5182 5948 
(no RT) 
H 4 CharmeRT 20199 18174 10107 5778 7062 
| 2 Masten 21177 3047 | 17276 
Percolator 
| 2 MaxQuant 18973 1448 15095 1048 
pParse + 
| 2 Mascot + 20568 6012 16813 1994 
Percolator 
| 2 MS Amanda- | aiy 1782 17230 
* Percolator 
MS Amanda 
| 2 + Elutator 22313 9770 13526 4614 3327 
(no RT) 
| 2 CharmeRT 22796 11970 13191 5346 4232 


Table S3: Identified PSMs and unique peptides at 1% FDR (PSM or peptide level) for all Figures presented in the manuscript 


Total RNA Contaminants No RNA Zero RNA 
protein Expressed Expression Data | Expression 
groups 

First search only 3741 3550 24 85 82 
First * Second searches 4696 4435 30 118 113 


Table S4: Mapping grouped proteins identified in first and second searches to RNA HeLa protein expression data. 


—— NANAVMEYEK 
—— SNcMDcLDR 


Figure S11 Chimeric spectrum example. Spectrum is part of data set D (HeLa tryptic digest, Q Exactive Hybrid, 3h gradient, 
4m/z isolation width).Matched ions of peptide NANAVMEYEK are given in blue and ions of peptide SNCMDcLDR (with Cs 
being carbamidomethylated) are given in red. 


NANAVMEYEK SNcMDcLDR 

matched ion m/z delta mass [ppm] matched ion m/z delta mass [ppm] 
[M+2H] 584.768 48.87 ImmL 86.096 6.39 
ImmY 136.076 0.44 yl 175.119 0.17 
y1 147.113 1.09 b2 202.082 0.45 
b2 186.087 1.34 y2 290.146 2.41 
y2 275.195 1.23 b3 362.113 3.07 
b3 300.130 2.00 y3 403.230 8.06 
y3 439.219 0.93 y4 563.261 1.70 
b4 371.167 3.23 y5 678.288 1.39 
y4 568.261 3.54 y6 809.328 2.29 
y5 699.302 1.97 y7 969.359 4.18 
y6 798.370 1.79 

b7 730.319 10.43 
y7 869.407 1.14 
y8 983.450 0.00 
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Figure S12 Chimeric spectrum example. Spectrum is part of data set D (HeLa tryptic digest, Q Exactive Hybrid, 3h gradient, 
4m/z isolation width).Matched ions of peptide CQAAEPQIITGSHDTTIR (with C being carbamidomethylated) are given in blue 


and ions of peptide QLVAEQVTYQR are given in red. 


CQAAEPQIITGSHDTTIR QLVAEQVTYQR 

matched ion m/z delta mass [ppm] matched ion m/z delta mass [ppm] 
y1 175.119 0.57 [M+2H] 667.857 16.40 
b2 289.097 1.80 y1 175.119 0.57 
y2 288.203 3.37 b2 242.150 1.36 
b3 360.134 1.86 y2 303.178 1.25 
y3 389.251 0.23 b3 341.218 3.25 
b4 431.171 10.09 y3 466.241 0.62 
y4 490.298 1.63 y4 567.289 1.53 
b5 560.213 7.93 y5 666.357 53.08 
y5 605.325 7.19 y6 794.416 0.35 
b6+ 329.137 24.82 y7 923.458 2.01 
y6 742.384 0.15 y8 994.495 1.21 
b7 785.325 14.91 y9 1093.564 1.26 
y7 829.416 10.38 y10 1206.648 10.26 
y8 886.438 1.74 

y8+ 443.723 0.88 

y9 987.485 0.35 

y9+ 494.246 7.30 

y10 1100.569 1.85 

y10+ 550.788 0.20 

y11 1213.654 5.69 

y12 1341.712 0.48 

b13+ 697.330 48.34 

y13 1438.765 1.01 

y13+ 719.886 1.18 
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Figure S13 Chimeric spectrum example. Spectrum is part of data set D (HeLa tryptic digest, Q Exactive Hybrid, 3h gradient, 
4m/z isolation width).Matched ions of peptide GTITVSAQELK are given in blue and ions of peptide VMEIVDADEK are given in 


red. 
GTITVSAQELK VMEIVDADEK 

matched ion m/z delta mass [ppm] matched ion m/z delta mass [ppm] 

ImmiL 86.096 6.27 Imml 86.096 6.27 
y1 147.113 0.95 y1 147.113 0.95 
b2 159.076 0.57 b2 231.116 2.60 
y2 260.197 0.42 y2 276.155 0.80 
b3 272.161 8.78 b3 360.159 5.14 
y3 389.239 1.70 y3 391.182 0.05 
b4 373.208 7.98 b4 473.243 3.61 
y4 517.298 0.41 y4 462.220 3.07 
y5 588.335 0.29 b5 572.311 12.34 
b6 559.309 64.06 y5 577.246 7.73 
y6 675.367 3.01 b6 687.338 1.82 
y7 774.436 2.20 y7 789.399 2.51 
y8 875.483 1.98 y8 918.442 2.23 
y9 988.567 4.54 y9 1049.482 4.60 
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Figure S14 Chimeric spectrum example. Spectrum is part of data set D (HeLa tryptic digest, Q Exactive Hybrid, 3h gradient, 
4m/z isolation width).Matched ions of peptide LQELPDAVPHGEMPR are given in blue, ions of peptide YGPLPGPAVPR are 
given in red, and ions of peptide ELTGEDVLVR are given in green. 


LQELPDAVPHGEMPR YGPLPGPAVPR ELTGEDVLVR 
matched m/z delta matched m/z delta matched m/z delta 
ion mass ion mass ion mass 
[ppm] [ppm] [ppm] 
y1 175.119 0.17 [M+2H] 562.316 11.58 y1 175.119 0.17 
b2 242.150 1.57 ImmY 136.076 0.37 b2 243.134 6.91 
y2 272.172 0.66 y1 175.119 0.17 y2 274.187 0.15 
b3 371.193 0.24 b2 221.092 6.06 b3 344.182 35.59 
y3 403.212 | 49.75 y2 272.172 0.66 y3 387.271 0.49 
b4 484.277 86.36 b3 318.145 0.97 b4 401.203 4.89 
y4 532.255 3.70 y3 371.240 | 15.17 y4 486.340 0.41 
y5 589.276 1.22 b4 431.229 0.60 y5 601.367 0.23 
y6 726.335 0.73 y5 539.330 1.61 y6 730.409 2.12 
y7 823.388 2.25 y6 596.351 0.99 y7 787.431 8.90 
y8 922.456 2.07 y7 693.404 1.62 y8 888.479 1.56 
y9 993.494 | 17.11 y8 806.488 2.00 
y10 1108.520 3.39 y9 903.541 0.76 
y11 1205.573 2.84 y10 960.563 17.09 
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Figure S15 Chimeric spectrum example. Spectrum is part of data set D (HeLa tryptic digest, Q Exactive Hybrid, 3h gradient, 
4m/z isolation width).Matched ions of peptide AISHEHSPSDLEAHFVPLVK are given in blue, ions of peptide NDLSPTTVMSEGAR 
are given in red, and ions of peptide TPAFAESVTEGDVR are given in green. 


AISHEHSPSDLEAHFVPLVK NDLSPTTVMSEGAR TPAFAESVTEGDVR 
matched m/z delta matched m/z delta matched m/z delta 
ion mass ion mass ion mass 

[ppm] [ppm] [ppm] 

[M+3H] 738.383 16.92 [M+2H] 739.352 15.70 [M+2H] 739.854 16.37 

b2 185.128 2.70 y1 175.119 0.06 y1 175.119 0.06 

y2 246.181 1.18 b2 230.077 0.26 b2 199.108 0.95 

b3 272.161 25.10 y2 246.156 13.69 y2 274.187 2.52 

y3 359.265 1.39 b3 343.161 1.08 b3 270.145 0.44 

b4 409.219 0.73 y3 303.178 1.65 y3 389.214 9.71 

y4 456.318 2.59 b4 430.193 1.12 b5 488.250 34.57 

b5 538.262 1.97 y4 432.220 6.13 b6 617.293 7.27 

y5 555.386 | 5.64 y5 519.252 | 6.89 y6 676.326 | 8.32 

b6 675.321 1.05 y6 650.293 2.34 y7 775.394 10.38 

y6 702.455 | 0.23 y7 749.361 | 3.23 y8 862.427 | 5.61 

b7 762.353 2.53 y8 850.409 0.36 y9 991.469 11.50 

y7 839.514 5.86 y9 951.456 5.75 y10 1062.506 18.51 

y7+ 420.261 0.14 y10 1048.509 0.44 y11 1209.575 4.79 

y8 910.551 9.70 b11 1175.525 16.32 y12 1280.612 8.43 

b10 1061.465 0.42 y11 1135.541 0.74 b13 1304.600 0.58 

b10+ 531.236 0.08 y12 1248.625 5.18 

b11 1174.549 3.58 b13 1303.583 0.66 

b11+ 587.778 9.41 

b12 1303.591 5.48 

b12+ 652.299 0.69 

b13 1374.629 1.18 

b13+ 687.818 5.73 

y13 1451.789 4.42 

b14+ 756.347 0.38 

b15+ 829.882 5.45 

b16+ 879.416 0.10 

b18+ 984.484 11.78 
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Figure S16 Chimeric spectrum example. Spectrum is part of data set D (HeLa tryptic digest, Q Exactive Hybrid, 3h gradient, 
4m/z isolation width).Matched ions of peptide WVGGQHPCFIIAEIGQNHQGDLDVAK (with C being carbamidomethylated) are 
given in blue and ions of peptide VNLLSFTGSTQVGK are given in red. 


WVGGOHPCFIIAEIGONHQGDLDVAK VNLLSFTGSTOVGK 

matched ion m/z delta mass [ppm] matched ion m/z delta mass [ppm] 
b2 286.155 0.63 b2 214.119 0.05 
y2 218.150 0.50 y2 204.134 1.22 
b3 343.177 31.44 b3 327.203 0.18 
y3 317.218 0.06 y3 303.203 4.22 
y4 432.245 5.92 b4 440.287 1.91 
y5 545.329 4.57 y4 431.261 0.93 
b6 665.315 3.71 y5 532.309 6.42 
b7 762.368 9.88 y6 619.341 14.32 
y7 717.378 1.60 y7 676.362 5.46 
y8 845.436 4.96 b8 832.456 58.05 
b9 1069.467 1.07 y8 777.410 2.97 
y9 982.495 2.59 y9 924.479 0.12 
b10 1182.551 7.66 y10 1011.511 2.55 
y10 1096.538 2.31 y11 1124.595 0.74 
b11+ 648.321 56.56 y12 1237.679 1.52 
y11 1224.597 1.56 

y12 1281.618 0.41 

y12+ 641.313 2.96 

y13 1394.702 2.52 

y13+ 697.855 0.50 

y14 1523.745 0.32 

y14+ 762.376 0.62 

y15 1594.782 5.22 

y15+ 797.895 1.20 

b16+ 897.443 41.53 

y16 1707.866 1.48 

y16+ 854.437 2.25 

y18+ 984.513 5.47 

y19+ 1064.528 27.45 

y20+ 1113.055 4.11 
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Figure S17 Chimeric spectrum example. Spectrum is part of data set D (HeLa tryptic digest, Q Exactive Hybrid, 3h gradient, 
4m/z isolation width).Matched ions of peptide YLEVVLNTLQQASQAQVDK are given in blue and ions of peptide 
GIDVQQVSLVINYDLPTNR are given in red. 


YLEVVLNTLQQASQAQVDK GIDVQQVSLVINYDLPTNR 
matched ion m/z delta mass [ppm] matched ion m/z delta mass [ppm] 

b2 277.155 0.47 y1 175.119 0.06 
y2 262.140 0.42 b2 171.113 1.58 
b3 406.197 0.00 y2 289.162 6.22 
y3 361.208 2.60 b3 286.140 1.29 
b4 505.266 0.10 b4 385.208 1.38 
y4 489.267 4.50 y4 487.262 1.89 
b5 604.334 6.47 b5 513.267 7.05 
y5 560.304 3.77 y5 600.346 2.20 
b6 717.418 11.76 b6 641.325 5.40 
y6 688.362 4.74 y6 715:373 1.47 
b7 831.461 13.11 y7 878.437 13.98 
y7 775.395 0.27 y8 992.480 5.34 
b8 932.509 7.75 y9 1105.564 3.84 
y8 846.432 0.26 y10 1204.632 10.48 
y9 974.490 25.20 y11 1317.716 17.92 
y10 1102.549 1.60 y12 1404.748 2.99 
b11 1301.710 28.26 y13 1503.816 12.93 
y11 1215.633 0.26 b14 1544.796 10.60 
b12 1372.747 36.05 y14 1631.875 0.00 
y12 1316.681 0.39 

y13 1430.724 0.51 

y14 1543.808 0.86 

y15 1642.876 2.73 

y16 1741.944 2.58 

y17 1870.987 2.45 
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Figure S18 Chimeric spectrum example. Spectrum is part of data set D (HeLa tryptic digest, Q Exactive Hybrid, 3h gradient, 
4m/z isolation width).Matched ions of peptide GVDEVTIVNILTNR are given in blue and ions of peptide VVIGMDVAASEFFR are 


given in red. 
GVDEVTIVNILTNR VVIGMDVAASEFFR 

matchedion | m/z delta mass [ppm] matched ion | m/z delta mass [ppm] 
y1 175.119 0.17 ImmF 120.081 3.08 
b2 157.097 0.95 y1 175.119 0.17 
y2 289.162 0.24 b2 199.144 0.90 
b3 272.124 3.42 y2 322.187 0.99 
y3 390.210 2.18 b3 312.228 1.02 
b4 401.167 0.72 y3 469.256 0.13 
y4 503.294 1.35 b4 369.250 1.33 
b5 500,235 6.08 y4 598.298 5.03 
y5 616.378 4.10 y5 685.330 3.05 
b6 601.283 2.79 b6 615.317 4.73 
y6 730.421 0.29 y6 756.368 1.63 
b7 714.367 15.58 b7 714.385 10.46 
y7 829.489 3.70 y7 827.405 1.62 
y8 942.573 1.10 y8 926.473 3.21 
b9 927.478 2.64 y9 1041.500 0.82 
y9 1043.621 2.78 y10 1172.541 1.58 
y10 1142.689 0.57 y11 1229.552 0.78 
y11 1271.732 0.05 y12 1342.646 0.45 
y12 1386.759 7.77 


Chapter 5 


Related Work 


5.1 A Symbolic Regression Based Scoring System 
Improving Peptide Identification for MS Amanda 


'This section covers the comparison of validating peptide spectrum matches 
using genetic programming and random forests [8], which has been 
published in the ACM Press, 2015. Both, white box and black box models 
are used to distinguish between false and correct identifications and are 
compared to each other [22]. 

Reprinted with permission from Dorfer, V.; Maltsev, S.; Dreiseitl, S.; 
Mechtler, K.; Winkler, S. A Symbolic Regression Based Scoring System Im- 
proving Peptide Identifications for MS Amanda. Proceedings of the Com- 
panion Publication of the 2015 Annual Conference on Genetic and Evolu- 
tionary Computation, pages 1335-1341. 

https:/ /dl.acm.org/citation.cfm?doid—2739482.2768509. Copyright 2015 
ACM New York. 
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ABSTRACT 


Peptide search engines are algorithms that are able to iden- 
tify peptides (i.e., short proteins or parts of proteins) from 
mass spectra of biological samples. These identification al- 
gorithms report the best matching peptide for a given spec- 
trum and a score that represents the quality of the match; 
usually, the higher this score, the higher is the reliability of 
the respective match. In order to estimate the specificity 
and sensitivity of search engines, sets of target sequences 
are given to the identification algorithm as well as so-called 
decoy sequences that are randomly created or scrambled ver- 
sions of real sequences; decoy sequences should be assigned 
low scores whereas target sequences should be assigned high 
scores. 

In this paper we present an approach based on symbolic 
regression (using genetic programming) that helps to dis- 
tinguish between target and decoy matches. On the ba- 
sis of features calculated for matched sequences and using 
the information on the original sequence set (target or de- 
coy) we learn mathematical models that calculate updated 
scores. As an alternative to this white box modeling ap- 
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proach we also use a black box modeling method, namely 
random forests. 

As we show in the empirical section of this paper, this 
approach leads to scores that increase the number of reliably 
identified samples that are originally scored using the MS 
Amanda identification algorithm for high resolution as well 
as for low resolution mass spectra. 


Keywords 


Proteomics; peptide identification; symbolic regression 


Categories and Subject Descriptors 


H.2.8 [Database Applications: Data mining; 1.2.8 
[Artificial Intelligence]: Heuristic methods; J.3 [Life and 
Medical Sciences]: Biology and genetics 


1. INTRODUCTION 


Mass spectrometry based proteomics has emerged to a 
powerful and widely used technique in the analysis of bio- 
logical samples [2]. Obtained so-called tandem mass spec- 
tra contain peaks as mass-to-charge ratios and respective 
ion intensities of peptide fragments. Peptide search en- 
gines are used to identify peptides (i.e., short proteins or 
parts of proteins) from those mass spectra. These identi- 
fication algorithms report the best matching peptide for a 
given spectrum and a score that represents the quality of the 
match. A score dependent on an identification algorithm is 
assigned to each peptide spectrum match (PSM); usually, 
the higher this score, the higher is the reliability of the re- 
spective match. There are several scoring algorithms that 
are frequently used in modern proteomics incorporating var- 
ious strategies to evaluate the quality of a PSM, e.g., Mascot 


[14], SEQUEST [8], Andromeda [5], and, most recently, MS 
Amanda [6]. 

In order to estimate the specificity and sensitivity of 
search engines, sets of target sequences are given to the 
identification algorithm as well as so-called decoy sequences 
that are randomly created or scrambled versions of real se- 
quences. As no gold standard data are available for pro- 
teomics experiments, these target-decoy searches are used 
to estimate false identifications among matches to the tar- 
get database. [13, 7] In practice, a threshold 6 is defined and 
only PSMs with a score higher than this threshold are ac- 
cepted. 0 is set to that certain value leaving only a desired 
number of decoy matches above the threshold. Applying 
this false discovery rate (FDR), the number of false identi- 
fications can be estimated as being equal to the number of 
decoy hits and is usually set to values such as, e.g., 196. 

Appropriate peptide identification algorithms should as- 
sign low scores to false and decoy sequences whereas target 
sequences should be assigned high scores. Obviously, these 
approaches are not always working perfectly - there will al- 
ways be true PSMs that are scored below 0. To improve the 
discrimination between correct and wrong identifications we 
here present a machine learning approach for target-decoy 
classification. On the basis of features that are calculated for 
matched sequences and using the information about previ- 
ously analyzed samples on the original sequence set (target 
or decoy) we learn mathematical models that calculate up- 
dated scores. 

This approach is inspired by Percolator [9], a semi- 
supervised learning method for peptide identification from 
shotgun proteomics datasets. Percolator uses support vector 
machines to learn models that discriminate between positive 
and negative PSMs. Instead of support vector machines, we 
want to focus on white box modeling, namely symbolic re- 
gression by genetic programming for training such discrim- 
inators. White box models may further be used to improve 
score calculation of peptide identifications algorithms. 

In Section 2 we define the algorithmic details of the ap- 
proach pursued to reach these goals. As we show in the 
empirical section (Section 3) of this paper, this approach 
leads to scores that increase the number of reliably identi- 
fied samples that are originally scored using the MS Amanda 
identification algorithm. In Section 4 we discuss these results 
and give an outlook to further research in this area. 


2. ALGORITHMS 


2.4 Overall Workfl w 


'The overall workflow of the algorithm described in this 
paper is shown in Figure 1. In an initial training phase we 
collect information about analyzed PSMs and train models 
that shall assign improved scores to PSMs; later, these mod- 
els are used to calculate updated scores that shall help to 
distinguish more clearly between target and decoy peptide 
spectrum matches. 


2.1.1 Training Phase 


e First, in the training phase, standard peptide spec- 
trum matching results are collected. For each given 
spectrum we get PSMs plus respective scores. Addi- 
tionally, we also know which PSMs are decoy hits and 
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which target hits are above 1% FDR and therefore 
considered true hits. 


e From this information we calculate a new score for 
each PSM psm: 


— If psm is a decoy hit, then it is assigned 0: 


(1) 


is decoy(psm) & scorenew(psm) = 0 


— Otherwise, if psm is a true hit, then it is assigned 
the original score: 


lis decoy(psm) & score(psm) > 0 


(2) 


> scorenew(psm) = score(psm) 


— All other PSMs matching the target database are 
not used for training as it is doubtful whether 
those are true or false hits. 


e These new scores are then used in combination with 
further information on the PSMs, especially on peptide 
sequences, for training models that assign estimates for 
the new score to new, unseen PSMs. For all PSMs the 
following features are calculated: 


— Scores calculated by the peptide identification al- 
gorithm. 


— Mass spectrum specific features such as the mass 
to charge ratio and the charge state of the spec- 
trum. 


— Peptide specific features such as the score differ- 
ence to the second best matching peptide!, the 
peptide length, or the number of missed cleav- 
ages. 


2.1.2 Application Phase 


e In the later application phase, new spectra are pre- 
sented and, using a peptide search engine, PSMs are 
calculated. Additionally, using the previously gener- 
ated mathematical models, an updated score is also 
calculated for each newly presented PSM. 


As usual, a threshold 0 is set such that only a cer- 
tain ratio of decoy hits are within considered PSMs 
estimating the number of false identifications among 
target hits. We can then calculate how many target 
PSMs can now be confidently identified using either 
the standard workflow or the new workflow with up- 
dated scores presented here. 


‘Usually, only the best matching peptide is considered for 


a spectrum, still the distance to the second best matching 
peptide is of high importance. 


Information 


(Classification) 
Target / Decoy 
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Figure 1: Overview of enhanced peptide identification using MS Amanda and machine learning. 


2.0 Methods 


In this section we describe the methods we use for identi- 
fying peptides and for calculating models that estimate new 
scoring values: 

e As peptide identification algorithm we use MS 
Amanda. 


e For machine learning we use genetic programming as 
well as random forests. 


2.2.1 MS Amanda 


To identify peptides out of mass spectra we used the 
database search algorithm MS Amanda [6]. MS Amanda is 
a scoring approach especially designed for mass spectra with 
high mass accuracy and outperforms gold standard peptide 
identification algorithms Mascot and SEQUEST at the same 
false discovery rate. This scoring algorithm is freely avail- 
able at http://ms.imp.ac.at/?goto=msamanda as a platform 
independent standalone application as well as integrated in 
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Proteome Discoverer, SearchGUI [18], and PeptideShaker 
[19]. 


2.2.2 Genetic Programming 


For symbolic regression we use genetic programming (GP) 
[12] with strict offspring selection (OS) as described in [21] 
and [1]. The functions set described in [21] (including arith- 
metic as well as logical ones) was used for building composite 
function expressions. 


Applying offspring selection has the effect that new indi- 
viduals are compared to their parents; in the strict version, 
children are passed on to the next generation only if their 
quality is better than the quality of both parents. Figure 2 
shows our GP implementation with OS, Figure 3 schemati- 
cally shows OS (standard as well as strict). 

In addition to splitting the given data into training and 
test data, we apply GP in such a way that a part of the given 
training data is not used for training models and serves as 


Population of 

b Models 
Offspring 
Selection 


b Parents Selection 


Generation of Xe 


Models (by Crossover, 
Mutation, ...) 


Test (Evaluation) 
of Models 


Figure 2: Genetic programming with offspring se- 
lection [21]. 
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Figure 3: Offspring selection [1]. 


validation set; in the end, when it comes to returning the 
eventual results, the algorithm returns those models that 
perform best on validation data. This approach has been 
chosen because it helps to cope with over-fitting; it is also 
applied in other GP based machine learning algorithms as 
for example described in [3]. 

We use GP as implemented in HeuristicLab [20, 11] (http: 
/ / dev.heuristiclab.com), a framework for prototyping and 
analyzing optimization techniques for which both generic 
concepts of evolutionary algorithms and many functions to 
evaluate and analyze them are available. Figure 4 shows GP 
solving a regression problem in HeuristicLab 3.3.11. 


2.2.3 Random Forest Classificatio 


Random forests (RFs, [4]) are ensembles of decision trees, 
each depending on randomly chosen samples and features. 
The best known algorithm for inducing random forests was 
first described in [4] combining bagging and random feature 
selection: 


e For each tree in the forest, a certain number of input 
variables is used to determine the decision at a node 
of the tree. 


Figure 4: Solving a regression problem with sym- 
bolic regression in HeuristicLab 3.3.11. 


e A certain number of samples is randomly drawn from 
the training data base; the rest of the samples is used 
as internal validation set for estimating the model's 
prediction error (out-of-bag error). 


When it comes to calculating the value predicted for a 
given sample, this sample is pushed down the trees and is 
assigned the label (predicted value) of the terminal node 
it eventually ends up in. 'This procedure is executed for 
all trees in the forest and the final prediction for the given 
sample is the mode vote of all trees. 

RFs are a very popular machine learning method as they 
are known to be one of the most accurate learning algo- 
rithms available [15], robust against overfitting, and widely 
considered a very efficient machine learning method. 

Figure 5 schematically shows the aggregation of estimated 
target values produced by a set of trees as implemented for 
random forests. 


O leaf nodes 
v ý O split nodes 
t, ty 
> yi(v) 2 yr(v) 


Y 
1 
y) ^z ») 


Figure 5: Random forest regression (adapted from 
[16]). 


3. EMPIRICAL TESTS 
3.4 Sample Preparation and Data 


To test our approach we used two mass spectrometry data 
sets from a human cancer cell line: 


e The first data set DS1 (1 ug) was measured on a 
'Thermo Fisher QExactive mass spectrometer and ac- 
quired along a 3h gradient (high resolution data set for 
MS2 spectra), 


e the second data set DS2 (1 ug, 1h) was acquired on 
a Thermo LTQ-Orbitrap Velos and first reported in 
Koecher et al. [10]. 


Resulting spectra where analyzed in Proteome Discoverer 
(version 1.4.0.288) using MS Amanda. Mass spectra were 
matched to the uniprot human protein database [17] includ- 
ing isoforms and extended for common contaminants and 
reverted protein sequences accounting for decoy proteins. 
Database search was conducted using trypsin as digestion 
enzyme and a 2 missed cleavages constraint. For the high 
resolution data set (DS1) 15 ppm and 0.02 Da were used as 
precursor mass and as fragment mass tolerance, respectively, 
while we used 10 ppm and 0.5 Da for the low resolution 
data set (DS2). Carbamidomethylation of cysteine and oxi- 
dation of methionine were set as fixed and variable peptide 
modifications, respectively. For each spectrum MS Amanda 
reported up to 5 best matching peptides, with an Amanda 
score ranging between 0.4 and 662 (662 representing a top 
match). 

Each so obtained data set was split into one set of PSMs 
that are used for training models and one for testing our 
combined approach: 


e For DS1, 30,000 samples (PSMs) are used for training 
and model selection, the remaining 155,271 samples 
(PSMs) are here used as test samples. 


e For DS2, 2,000 samples (PSMs) are used for train- 
ing and model selection, the remaining 35,163 samples 
(PSMs) are here used as test samples. 


3.2 Test Results 


Both machine learning methods applied here, GP and 
RFs, were executed with varying parameter settings: 


e For GP different model size constraints and population 
sizes were applied: The allowed model depth was var- 
ied between 6 and 10, the allowed model complexity 
was varied between 50 and 200, and the mutation rate 
was varied between 10% and 30%. A combination of 
random and roulette parent selection was applied as 
well as offspring selection. Offspring was kept strict 
for all executions, i.e., in each generation only those 
models were propagated to the next generation that 
performed better than both parents. The maximum 
selection pressure was set to 100, and this was used 
as termination criterion. As fitness function we used 
Pearson's correlation coefficient (R?). 


e For RFs, different values for the parameters M (the 
ratio of features used for creating the trees), R (the 
ratio of samples used for training the trees), and the 
number of trees were tested: M was varied between 0.3 
and 0.7, R was also varied between 0.3 and 0.7, and 
the number of trees was varied between 50 and 200. 
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For both methods, the 5 models with best performance on 
training (in the case of GP: validation) data were selected; 
the quality of a model is calculated as the correlation (R?) 
of estimated and original scores. The test results given in 
the following are calculated as the average performance of 
the so selected models, where performance is calculated as 
PSMs on test data not seen by the identification algorithms. 

We here analyze test results (i.e., reported PSMs) for dif- 
ferent false discovery rates (FDR) as well as varying size 
limits for the peptides. Tables 1 and 2 summarize the re- 
sults where column “ml” gives the minimum peptide length, 
“A” the results achieved using MS Amanda, “A+GP” the 
results achieved using the combination of MS Amanda and 
GP, and “A+RF” the combination of MS Amanda and RFs. 
Figures 6 and Figures 7 show these results graphically. 


Table 1: Test results achieved for data set DS1. 


[FDR mi] A [A-GP [ A-RF | 
0.196 6 7586 8245 9326 
|" | sara | vanon 
7 7610 9218 9319 
MP 3E 
8566 9311 10615 
MP CIE 
10350 12452 12956 
|" | 220% | +252% 


+20.3% 
10857 12953 13246 
| | 10.9% | +220% | 
11094 13144 12641 
[| siss% | +139% 
11976 14451 13837 
ME 7153 
12154 14407 13912 
| sin | +14.5% 
11892 13493 13390 
| ^p essox | 120% | 


Table 2: Test results achieved for data set DS2. 
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Figure 6: PSMs identified for the first data set (D51) 
using standard MS Amanda (A, blue) compared to 
results obtained using new scores calculated with 
models generated by symbolic regression (A+GP, 
red) and random forests (A+RF, green). 
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Figure 7: PSMs identified for the second data set 
(DS2) using standard MS Amanda (A, blue) com- 
pared to results obtained using new scores calcu- 
lated with models generated by symbolic regression 


(A+GP, red) and random forests (A+RF, green). 
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3.3 Test Results Discussion 


The results summarized in Tables 1 and 2 show that in all 
cases, i.e., for both data sets and for all choices of minimum 
peptide length and false discovery rate, the scores calculated 
by models identified by nonlinear modeling lead to better 
peptide identification rates. 


e For the high resolution data set (DS1) we see that the 
number of identified PSMs can be increased by 1096 — 
20%: 


— The highest relative increase (up to +25%) can 
be seen for 0.5% FDR, and also for 0.1% FDR 
and also 1% FDR the number of PSMs can be 
increased by up to 20% and more. 


— For FDR 0.196 RFs show a better performance 
than models identified using GP, 


— whereas models learned using GP perform better 
when setting the false discovery rate to 196. 


e For the low resolution data set (DS2) we see that the 
performance increase is not as high as for the high 
resolution data, but still the numbers of identified 
PSMs can be increased significantly. We here see that 
the models identified by genetic programming perform 
better than RFs: 


— For 0.1% FDR, using scores calculated by mod- 
els identified by GP the performance can be in- 
creased by up to 1496, whereas using RFs in- 
creases the performance by up to 4%. 


For 0.596 FDR, both machine learning approaches 
tested here lead to performance increases of 7% — 
10%. 


For 1% FDR, both machine learning approaches 
tested here lead to performance increases of 4% — 
896, where results achieved using GP are slightly 
better than those achieved using RFs. 


4. CONCLUSION 


We have tested various machine learning approaches for 
calculating new scores for peptide spectrum matches of high 
accuracy mass spectra. Results show that not only black 
box modeling (using RFs or SVMs, as used in Percolator), 
but also white box modeling (using symbolic regression) is 
perfectly well suited for improving the separation of correct 
and false peptide identifications of mass spectra. White box 
approaches generate models that can be analyzed regarding 
their structure and variable impacts, and they can also be 
compared for different data sets as those models are trans- 
parent. Components of these models can provide further 
insight into characteristics of target versus decoy identifica- 
tions which may additionally be integrated in peptide iden- 
tification algorithms in advanced scoring models. Future 
plans include detailed analysis of the generated models to 
extract significant differentiation properties that shall fur- 
ther be integrated in the peptide identification algorithm 
MS Amanda. The comparison of models generated for dif- 
ferent data sets will help us to gain further insight in peptide 
identification. 
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5.2 Expanding the use of spectral libraries in pro- 
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ABSTRACT: The 2017 Dagstuhl Seminar on Computational 
Proteomics provided an opportunity for a broad discussion on 
the current state and future directions of the generation and 
use of peptide tandem mass spectrometry spectral libraries. 
Their use in proteomics is growing slowly, but there are 
multiple challenges in the field that must be addressed to 
further increase the adoption of spectral libraries and related 
techniques. The primary bottlenecks are the paucity of high 
quality and comprehensive libraries and the general difficulty 
of adopting spectral library searching into existing workflows. 
There are several existing spectral library formats, but none 
captures a satisfactory level of metadata; therefore, a logical 
next improvement is to design a more advanced, Proteomics 
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Standards Initiative-approved spectral library format that can encode all of the desired metadata. The group discussed a series of 
metadata requirements organized into three designations of completeness or quality, tentatively dubbed bronze, silver, and gold. 
The metadata can be organized at four different levels of granularity: at the collection (library) level, at the individual 
entry (peptide ion) level, at the peak (fragment ion) level, and at the peak annotation level. Strategies for encoding mass 
modifications in a consistent manner and the requirement for encoding high-quality and commonly seen but as-yet-unidentified 
spectra were discussed. The group also discussed related topics, including strategies for comparing two spectra, techniques for 
generating representative spectra for a library, approaches for selection of optimal signature ions for targeted workflows, and 
issues surrounding the merging of two or more libraries into one. We present here a review of this field and the challenges that 
the community must address in order to accelerate the adoption of spectral libraries in routine analysis of proteomics datasets. 


KEYWORDS: mass spectrometry, spectral libraries, standards, formats, Dagstuhl Seminar, meeting report, 
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B INTRODUCTION 


Mass spectrometry (MS)-based proteomics has enabled the 
high-throughput identification of proteins present in biological 
samples and the measurement of their abundances, post- 
translational modifications, sequence and splice variants, and 
interaction partners. Although sample preparation techniques 
and instrumental setups remain complex and vary greatly, an 
increasing number of laboratories are applying MS techniques 
to better understand health and disease and to address basic 
biological questions. In typical MS-based proteomics experi- 
ments, proteins are extracted from samples and enzyme-digested 
into peptides, which are separated by chromatography and 
ionized. The mass spectrometer produces digital signatures of 
these ions at the precursor and fragment ion level. Modern 
instruments can record the signatures of hundreds of thousands 
of peptidoforms per experiment. 

The translation of these signatures into the desired informa- 
tion about their respective peptides and proteins is crucial for 
further interpretation. There are many software packages that 
have been developed over the past 25 years to perform the 
computational analyses needed to perform this task. For data- 
dependent acquisition (DDA) workflows, where instruments 
automatically select which ions to analyze based on simple rules, 
the most common analysis technique is sequence database 
searching.” This involves matching observed fragmentation 
mass spectra to simple simulations of spectra corresponding to 
peptides that may be present in the sample and selecting the best 
match for further validation. Once sufficiently confident iden- 
tifications are made, those peptide—spectrum matches (PSMs) 
can be stored in a library of previously identified spectra 
(a spectral library), which could be used for subsequent analyses 
of other data. 

Spectral library searching, as opposed to sequence searching 
via in silico-predicted fragmentation spectra, typically has greater 
sensitivity for peptide ions included in the library." Spectral 
library-based analyses would, therefore, seem like the method of 
choice for analysis of new datasets, but relatively few DDA 
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datasets are analyzed in this way. A major reason for this is the 
widespread concern that current libraries are incomplete. Peptide 
ions for which no corresponding spectrum exists in the reference 
library will not be identified, and thus some potentially important 
peptides may be missed. Data-independent acquisition (DIA) 
workflows ^? have recently undergone rapid growth due to faster 
and higher mass accuracy instrumentation, affording acquisition 
methods such as SONAR,° SWATH-MS,’ and MSX. In these 
techniques, highly multiplexed fragmentation spectra are acquired 
according to predefined data acquisition patterns, independent 
of observations within the run, and the analyses of these data 
have spurred new interest in spectral libraries. Although library- 
free methods are emerging, ^ the most commonly used ana- 
lysis techniques for LC-MS DIA data rely on spectral libraries to 
analyze extracted ion chromatograms to test for the presence of 
and quantify the abundance of peptide ions in the reference 
library. ^? Other targeted workflows, such as selected or parallel 
reaction monitoring (SRM/PRM), increasingly rely on large-scale 
spectral libraries to determine which proteotypic ^"? peptides 
and fragment ions to monitor.” 

With billions of fragment ion spectra acquired by the research 
community to date, we argue that it should be possible to 
leverage these big data for the processing of all new data acquired. 
However, the current state of spectral libraries, the software that 
generates them, and software that can use them lag far behind 
the availability of data. Data from public repositories, such as 
PeptideAtlas, ^ ”* PRIDE,”””° MassIVE, GPMdb," Proteo- 
micsDB,? and tools are available;”” ? the major hindrances are 
familiarizing researchers with software tools, rendering those 
tools user-friendly, and promoting the use of spectral-based 
methods to become the norm rather than the exception. 

At the 2017 Dagstuhl Seminar on Computational Proteomics 
(Seminar 17421), hosted October 16—20, 2017, at Schloss 
Dagstuhl in Wadern, Germany, a group of participating 
researchers (hereafter referred to as "the group") discussed the 
current state and future directions of spectral libraries in the field 
of proteomics. A follow-up meeting at the 2018 Proteomics 
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Standards Initiative ??^ (PSI) Spring Workshop in Heidelberg, 
Germany (April 18-20, 2018) provided an opportunity for 
further discussion and resulted in a draft of metadata that should 
be encodable in an eventual PSI spectral library format. In this 
article, the major topics of discussion and some resulting 
conclusions are presented, with a special focus on what actions 
can be taken in the near term to advance the field. The benefits 
and requirements for a new PSI standard spectral library format 
are discussed, along with the issues surrounding the develop- 
ment of single-source and community-source spectral libraries 
and the state of major applications of libraries. The article 
concludes with a summary of the future opportunities that were 
discussed by the group. 


E A NEW PSI FORMAT 


There are several formats for spectral libraries used in 
proteomics applications. The oldest and most widely used is 
the simple, text-based MSP format from the National Institute of 
Standards and Technology (NIST). Highly similar to this is the 
SpectraST ^? splib format, which is essentially a binary indexed 
version of MSP. SpectraST also writes a companion sptxt format, 
which is the same as MSP. The Global Proteome Machine"? 
(GPM) releases libraries in its hlf format for use with its 
X! Hunter tool.” The bibliospec tool” began with the original 
text-based blib format and later moved to a SQLite-based 
implementation in the blib2 format. The Center for Computa- 
tional Mass Spectrometry (CCMS) suite of spectral library 
searching tools 83839 and the MassIVE-KB spectral libraries use 
an extended version of the MGF format originally proposed by 
MatrixScience. Each of these formats continues to be used, but 
there is a widespread opinion that none provides the richness of 
metadata that ought to be available in modern spectral libraries. 

To address this, the Human Proteome Organization ^ 
(HUPO) PSP??*^! has been gathering participants interested 
in designing a next-generation standard spectral library format. 
Funding from the National Institutes of Health has recently 
been obtained for this development, and initial efforts have 
begun, with ongoing work accessible in the PSI SpectralLibrar- 
yformat GitHub repository (https://github.com/HUPO-PSI/ 
SpectralLibraryFormat). The success of PSI-developed formats 
largely depends on the breadth of participation in the definition 
of requirements and design of the format, and the groups 
gathered at Dagstuhl and Heidelberg offered a great opportunity 
to gather broad input about the requirements for a community- 
approved format. Further interactions on GitHub following the 
meetings allowed additional external inputs. Additional input 
from the community is welcome via the issue tracker at the 
above URL. 

Note that there is sometimes a distinction drawn between a 
spectral library and a spectral archive, such that the spectral 
archive can contain spectra that could not be identified." Here 
this distinction is not made and the term "spectral libraries" 
refers to collections of mass spectra, identified or unidentified, 
that have been assembled to serve as a reference data after the 
original data processing. 

The greatest identified need for a new format is the intro- 
duction of more metadata that can adequately describe the data 
within the spectral library and the library itself. These metadata 
can be broadly organized into four levels. Collection-level 
metadata describe attributes of the library as a whole, such as 
information about the creation or last update, source of the 
library, and global false discovery rate (FDR) of the library. 
Entry-level metadata describe attributes of each spectrum entry 
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in the library, such as its charge, fragmentation type, origin, 
inferred peptide identification (when known) and retention 
time. Peak-level metadata describe attributes of each fragmenta- 
tion ion peak, including its inferred charge, intensity, and frac- 
tion of replicates containing the peak. Peak interpretation-level 
metadata describes attributes of each fragmentation ion peak 
interpretation (of which there may be multiple per peak), 
including the probable molecule yielding the peak, the isotope 
state, and the delta (m/z) between the observed peak and the 
proposed interpretation. A list of proposed metadata elements at 
these four levels as drafted by this group and follow-on discus- 
sions is provided in the Supporting Information, which will serve 
as a design input for the new format. This list does not represent 
the final specification. 

There are several pieces of metadata considered here for the 
community format that merit further discussion, in part because 
they are not addressed well in previous formats, or they involve 
design choices that are not unanimously embraced. Perhaps 
foremost is the mechanism for specifying residue modifications, 
of which there are four broad classes: mass delta, chemical 
formula, English name, and controlled vocabulary term. The 
mass delta (e.g., “+15.99”) is perhaps the simplest mechanism, 
but suffers from potential precision or rounding problems that 
may lead to ambiguity. A chemical formula is precise and specific 
but will not be easily interpreted to the corresponding molecule 
by many human readers, and different molecules (e.g., glycans) 
may have the same formula but be distinct in structure. 
An English name is typically easily recognized by human readers, 
but can be context specific and the many synonyms and 
abbreviations in use make software recognition awkward (e.g., 
“Ox”, “MetOx”, “oxidation”, "L-methionine sulfoxide”). Finally, 
the use of controlled vocabulary (CV) terms is usually specific, 
but accession numbers are not easily recognized by human 
readers, and implementation of controlled vocabularies in soft- 
ware is often cumbersome, especially with multiple CVs to 
choose from (e.g., Unimod, PSI-MOD, PTMList). In the end, a 
design choice will be made to support one or more of these 
options to the dismay of some in the community. 

Current spectral libraries were designed with the notion that 
each entry would have an associated peptide identification. 
However, there is good reason to store unidentified spectra as 
well. There are many spectra that are repeatedly observed in 
independent experiments but remain unidentified, "^^ often 
because the component mass modifications or sequences are not 
considered in the search space. Several new searching algo- 
rithms, including MSFragger, ^ support open mass tolerance 
searches that are able to associate a partial match between a 
spectrum and a peptide, while leaving part of the identification as 
an unknown and unspecified mass delta; the new format should 
also support such matches that are partly identified, but also 
include an unidentified component. A curated list of commonly 
observed spectra that are unidentified but known to be often 
misidentified, leading to erroneous conclusions, would be an 
especially valuable addition to analysis pipelines. Some library 
formats support the addition of unidentified spectra, but often as 
a repurposing of a slot where many software packages already 
expect to find a parsable peptide sequence. Explicit support 
for unidentified spectra should be a key feature for the new 
PSI format. Furthermore, the format should be flexible enough 
to accommodate predicted spectra ^ ^? and interconverted 
spectra,” suitably annotated and differentiable as such, since 
there is likely to be rapid progress in the field of spectrum 
prediction and interconversion in the coming years. 
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It is also important to capture retention times in spectral 
libraries, as these are often used in downstream analyses. It is 
easy to capture retention times as acquired, but more useful to 
report calibrated retention times, along with the associated 
provenance information and metadata indicating which reten- 
tion time standard was used and how the calibration was performed. 

One reason that the PSI has not yet developed a standard 
spectral library format is that there is dissent about how the 
library should be encoded. Most PSI formats are XML-based or 
tab-separated-value-based, whereas the existing spectral library 
formats are a mix of plain text and binary formats. Plain text 
formats are promoted as being universally readable and easy 
for humans to examine manually and potentially correct when 
software runs into trouble, but they are inefficient in terms of 
disk space and computational resources. Custom binary formats 
are typically the opposite: far more efficient, but hard to restore 
and fix in case of corrupted or inconsistent data or when suitable 
supporting software is not easily within reach or no longer 
available. Broadly supported binary storage systems such as 
HDFS or SQLite provide attractive alternatives to some, but are 
seen as barriers by others in terms of added software complexity 
or lack of sufficient support in a programming language of 
choice. In some ways, this conundrum is still being played out 
with the mzML format,” where every year sees a new publica- 
tion purveying a format that is better than mzML in demon- 
strated ways," 755 while downplaying the trade-offs that others 
will find intolerable. In the end, the best strategy may be to 
develop a standard archival format where universal readability 
and carefully defined metadata are the primary design con- 
siderations, letting those in the community who demand effi- 
ciency transform the primary archival format into a more efficient 
version locally to suit their needs. 

A further important consideration for the development of a 
library format is the mandatory inclusion of quality metrics at 
each level. The quality of a library is a crucial parameter that 
should be considered by all downstream use of that library, 
as false identifications in the library will potentially lead to 
false identifications downstream. Therefore, the new format will 
require a computed posterior error probability or q-value for 
each spectrum entry, as well the overall estimated FDR for the 
library as a whole. This will enable tempering probabilities of 
correctness for downstream identifications with the probabilities 
in the library. In addition, it may also be necessary to extend the 
library by including spectra of decoy matches identified in the 
process of constructing the library, as these may be necessary to 
properly model false discovery rates in the search process. 


EM CHALLENGES FOR THE CREATION OF LIBRARIES 


Once a common spectral library format has been established, 
there will be challenges associated with the creation of libraries 
to ensure adoption by the community. Indeed, it is important 
that these challenges should be considered as use cases during 
the development of the format. One consideration is that the 
choice of which peaks to retain in library entries is often dictated 
by the anticipated end use of the library. For example, libraries 
designed for use by targeted proteomics or DIA methods may 
contain peaks only within restricted m/z (and/or ion mobility) 
ranges and only a handful of the most intense yet discriminating 
peaks, whereas all intense peaks are typically kept for DDA and 
other applications. Exclusion of reporter ions from isobaric 
labeling techniques may be advantageous for some applications, 
but not others. Among the group it was generally felt that the 
process of filtering libraries for a specific application was 
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undesirable; rather, such filtering should occur at runtime by the 
analysis software. Yet, the practice may remain common because 
the precise rules of filtering can be easily controlled by the end 
user during library transformation, while altering the analysis 
software may be far more difficult or impossible. Encoding these 
processing choices in the library metadata is important and must 
be supported. 

Another important consideration is the issue of spectrum 
variability by instrument. Fragmentation spectra produced by 
resonant excitation, such as in ion traps, tend to be fairly similar, 
but for beam-type collisional fragmentation spectra, the vari- 
ability as a function of collision energy is far more pronounced. 
While the new spectral library format should easily support 
differentiation by collision energy, the absolute scales of collision 
energy numbers varies among instrument manufacturers, or 
even between instruments from the same vendor. For some 
applications, the ramping of collision energy from one value to 
another during acquisition is performed. Even on a single instru- 
ment, natural drift in calibration can lead to some differences in 
the spectra collected at different times." Adequate metadata 
fields should be present to capture all cases accurately. 

Although spectral libraries can contain multiple spectra from a 
single peptide ion, most library creation tools will retain only a 
single representative spectrum for each peptide ion in the final 
library. There are broadly two categories of approaches to arrive 
at a single representative spectrum in spectral libraries, the best 
replicate and the consensus spectrum. In the best replicate 
approach, the spectrum that is deemed highest quality is retained 
in the library, although the decision of which is best varies 
among tools; it might be the spectrum that looks most like the 
other replicates, the highest signal-to-noise ratio (SNR) spec- 
trum, the spectrum with the highest ratio of explained to unex- 
plained peaks, or some combination of those. The best replicate 
may be encoded as is or after some noise filtering based on com- 
parison with other replicates. A consensus spectrum approach 
generally compares the top N replicates to each other, discards 
outliers, and then only retains peaks that appear in most of the 
replicates, discarding those that only appear in a few as noise or 
contamination. Input replicates are generally weighted by an 
estimate of SNR to compute the final intensities. Such consensus 
strategies typically filter out nearly all noise. It has been reported 
that consensus spectra perform better than best replicates, ^ but 
incremental addition of new replicates to a consensus library is 
problematic if the individual spectra are not easily accessible, 
whereas a new replicate can either be counted as another inferior 
replicate or supplant the previous best replicate, provided that 
the metric for best replicate does not require comparison with 
the other inferior replicates. Clearly the new format must accom- 
modate all of these approaches since a single best approach has 
not yet emerged. The metadata must encode the choice(s) 
behind the representative spectrum appropriately. 

The ability to merge multiple libraries is an important use case 
to consider and support. Many current methods for building and 
merging libraries rely on starting from scratch with each iteration 
that adds additional data, but as libraries grow substantially in 
size, this will become far less efficient and eventually infeasible. 
Therefore, design decisions that enable one library to be sub- 
sumed into another will be important. It should be possible to 
maintain minimum quality thresholds, compute a new overall 
global FDR at all levels (e.g., spectrum, precursor, peptide and 
protein), and retain complete pedigree information on the 
spectra (e.g, provenance from raw data) that remain present in 


the merged library. 
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An important new initiative of the PSI that will enable tracking 
of spectra that comprise a spectral library is the Universal 
Spectrum Identifier (USI) concept. The design of the USI is not 
yet complete and implemented, but aims to provide a unique 
multipart key for every spectrum ever submitted to Proteo- 
meXchange and potentially beyond. This would enable best 
replicate spectra and even consensus spectra to be traceable 
to their origins from within the format. More information on 
the development of the USI concept is provided in a recent 
summary of PSI activities" and at http:/ /www.psidev.info/usi. 
The USI differs from the SPLASH identifier? (http: / /splash. 
fiehnlab.ucdavis.edu/) used by metabolomics reference data- 
bases in that is it designed to identify all original experimental 
spectra via a multipart key rather than an algorithmically 
generated hash. 


BI SINGLE SOURCE AND COMMUNITY LIBRARIES 


Most spectral libraries so far are so-called "single-source" 
libraries, where a single group processes large numbers of mass 
spectra made available to them through their own analysis 
pipeline to produce a library for spectral library searching. A list 
of major sites providing such libraries is presented in Table 1. 


Table 1. Major Sites for Download of Peptide Spectral 
Libraries 


NIST http://peptide.nist.gov/ 

MassIVE http://massive.ucsd.edu/ProteoSAFe/static/massive-kb- 
libraries.jsp 

ProteomeTools http://www.proteometools.org/index.php?id=53 

PRIDE Cluster https://www.ebi.ac.uk/pride/cluster/#/libraries 

PeptideAtlas http://www.peptideatlas.org/speclib/ 

SWATHAtlas http:/ /www.swathatlas.org/ 

SRMAtlas http:/ /www.srmatlas.org/ 

GPMDB https:/ /www.thegpm.org/Hunter/index.html 

BiblioSpec https://proteome.gs.washington.edu/software/ 


bibliospec/v1.0/documentation/libs.html 


These libraries have the advantage that quality filtering is usually 
uniformly applied and reasonably well understood, either by 
direct encoding of quality metrics or by reputation. However, 
the comprehensiveness of these libraries is limited by the data 
provided to the creator. Most libraries cover a few biological 
species only and encompass only a subset of commonly used 
analytical platforms and methodologies. 

However, it has been shown that new big data approaches 
could be leveraged to build far more comprehensive community- 
sourced libraries.” In theory, the application of crowdsourcing 
efforts throughout the community could lead to a grand library, 
or set of libraries, that encompass all identifications achieved by 
the community as a whole thus far. This is in contrast to the 
previously described single-source libraries that are generated by 
a single group, even when the source data are collected from 
many laboratories. A core feature of such a community library 
infrastructure would be how to handle conflicting PSMs from 
different groups. Such a community library has the potential to 
transform the field of proteomics, enabling far more sensitive, 
specific, and comprehensive analyses of all datasets. Yet, in 
practice, creating such a comprehensive community library will 
be very challenging to achieve. 

One approach toward a comprehensive library would involve 
setting up a web resource for submitted identified spectra (or 
commonly seen but as yet unidentified spectra). All submissions 
will be processed and integrated into a growing community 
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library that can be downloaded and used by everyone. Spectra 
produced by the same peptide ion by different instrument 
classes and at different collision energies” would need to be 
stored separately and only aggregated when sufficiently similar. 
Spectra for contaminant PTMs, contaminant peptides (e.g., from 
a different species than claimed), and different derivatizations 
(e.g. isobaric labeling) would all need to be tracked appropriately. 
Spectral library search engines would likely only use a subset of the 
spectra from the community library as appropriate for the dataset 
being analyzed. 

However, one of the greatest challenges will be maintaining a 
high degree of quality in the community library. Requiring only 
the highest quality submissions may dissuade participation. 
Labeling contributions as either gold, silver, or bronze based on 
the completeness of metadata, quality of each spectrum (e.g., as 
measured by SNR) and the quality of each PSM (e.g, as 
measured by fraction of explainable intensity and number of 
peaks) is one approach to allow greater inclusiveness. No clear 
consensus on the precise definitions of gold, silver, and bronze 
emerged, but in general it was felt that all spectra should have full 
provenance to the dataset, MS run, and original scan number. 
A gold spectrum should have or be a corroborating spectrum 
from a synthetic peptide, have corroborating spectra from a 
different dataset, and have corroborating spectra from the same 
peptide sequence but a different ion (peptidoform or charge). 
Spectra that achieve at least one of these things would be silver, 
and spectra that achieve none would be bronze. Additional 
numerical metrics as described above should also apply, but 
further work is required to set sensible thresholds for the three 
levels. 

Complete automation would likely be required to ensure 
sustainability. Developing such a community library successfully 
would be a challenging undertaking. A pioneering example in 
the field of metabolomics is GNPS.*” Single-source libraries 
have also been very successful in metabolomics and other small 
molecule analysis; some (such as the NIST/EPA/NIH mass 
spectral library) started many decades ago and are still actively 
maintained.?? Success of library searching in metabolomics can 
be attributed to the fact that until recently, there has been no 
alternative to spectral library searching for metabolite identifi- 
cation"? due to inherent differences between protein and small 
molecule identification approaches. The number of charac- 
terized analytes and the number of biomolecules in reference 
libraries relevant for metabolomics is, however, usually orders of 
magnitude smaller than in proteomics; the large number of 
spectra in, for example, the NIST/EPA/NIH library is mostly 
due to derivatives. In contrast to proteomics, reference sub- 
stances are required to establish small molecule mass spectral 
libraries with confident identifications, thus generating reference 
spectra to put into a spectral library is far more difficult and time- 
consuming than in proteomics, as reference substances can be 
very expensive or impossible to obtain. Due to the inherent 
differences in applications, past success in metabolomics does 
not ensure success for proteomics. 

There was doubt among some participants at the Dagstuhl 
discussion that such a community library would become widely 
used. Anecdotes were related of large numbers of researchers 
preferring to develop their own libraries based on their own 
samples and instruments, even when an even more compre- 
hensive library fully suitable to their system of study was avail- 
able. Using a very large library introduces substantial challenges 
for proper FDR control, and testing many hypotheses that are 
not relevant for the current sample reduces sensitivity. It remains 
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an unresolved issue under active research in the field whether 
sample-specific libraries should be preferred to comprehensive 
libraries, especially as it pertains to DIA analysis. 


Bi APPLICATION OF SPECTRAL LIBRARIES 


Spectral library searching on its own is a powerful technique, 
demonstrated to be more sensitive and more specific than 
sequence searching, ^ but only for peptide ions present in the 
reference library. In order to identify those ions that are not in 
the reference library, it seems logical to couple spectral library 
searching with sequence database searching, where the former 
assigns those peptide ions that have been previously identified, 
and the latter identifies peptide species that are not in the library 
merging the results ofthe two approaches into a single output for 
the user. This has been possible for many years in the Trans- 
Proteomic Pipeline^'^? (TPP) with iProphet,^? but still is not 
commonly performed. Such a workflow has been recently 
implemented in Mascot Server, which may well increase the 
adoption of the approach. 

The target-decoy approach for estimating the number of 
false positives at any selected threshold is commonly applied for 
sequence database searching, either by including the decoy 
sequences in the searched sequence database or by generating 
the decoys on-the-fly. There are several approaches to gener- 
ating decoys, including reversing each protein sequence, generating 
random sequences based on the relative frequencies of amino acids 
in the database, reversing tryptic peptide sequences (i.e. holding 
the positions of lysine and arginine residues fixed and reversing 
between them), and scrambling the order of amino acids between 
lysines and arginines. Investigations into the best approach show 
comparable effectiveness! An easy metric to assess the usefulness 
of decoys is to compute the balance between targets and decoys 
forzero probability identifications; the idea is that the relative ratio 
of targets to decoys among the known incorrect results should be 
equal to the ratio of targets to decoys in the reference database. 
The target-decoy approach can also be applied to spectral 
libraries and spectral library searching, and several ways have been 
used to produce decoys, for example by adding a fixed value to the 
precursor and/or fragment m/z, randomly assigning new m/z 
values to the peaks, and scrambling the letters of the peptide 
(except for a terminal cleavage residue) while moving the 
identifiable peaks around to match the scrambled sequence. 
A comparison of these approaches was performed by Lam et al? 
The results indicate that none of the proposed approaches truly 
achieve equal probability for target and decoy matches in cases of 
a zero-probability match. This is likely because these approaches 
do not produce decoy precursor/spectra that are similar enough 
to real spectra. The approach of scrambling the peptide sequence 
and moving the known peaks outperforms the other approaches, 
however is still somewhat biased and consistently more so than 
the target-decoy approaches used in sequence database 
searching. This may in part be due the fact that un-annotated 
peaks are not moved and thus contribute to an incomplete 
prediction of what the scrambled peptide would be. Similarly, this 
technique becomes unavailable when libraries contain unassigned 
spectra since there is no sequence to scramble. Although this bias 
can be estimated and accounted for in the model to determine 
FDR, more work in this domain is needed. 

Another topic of discussion was the mechanism by which two 
spectra are compared, typically a library spectrum and a new 
experimental spectrum. There are two major aspects to this 
issue, the algorithm used to compare the intensities and m/z 
values, and how to handle peaks without a match. Several 
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broadly similar algorithms are available for comparison of 
spectra, most commonly a dot product, a dot product of the Nth 
root of intensities to reduce the influence of a few intense peaks, 
a cosine score, and the normalized spectral contrast angle 
approach.^9^7 Other approaches have also used probabilistic 
models of variation in peak intensities? or proposed machine 
learning models combining multiple features into a single score." 
However, perhaps a greater influence is exactly which peaks are 
aligned and go into the score. Exclusion of un-fragmented 
precursor peaks, reporter ion peaks, and other non-informative 
peaks seems logical, but approaches where the absence of a 
library peak in the acquired data is not penalized could lead to 
false positives with seemingly high scores if only a few peaks are 
shared. Some approaches include a training step to calculate 
characteristic parameter values for each peak.*”** Indeed, when 
calculating similarity scores between new experimental spectra 
and reference spectra from synthetic peptides, it is important 
that all informative peaks are included, even when a peak has no 
counterpart. In making the decision of which peaks to use, it 
is important to consider the intent of the comparison: ^? is 
spectrum A equivalent to spectrum B? Is spectrum A the primary 
constituent in spectrum B with minor additional contamination? 
Is spectrum A one of many constituents in spectrum B? 

Comparison of spectra generated from peptides in natural 
samples with spectra generated from synthetic peptides is a 
powerful technique for verifying that the spectrum identifica- 
tions are correct, and is specifically called out in the HPP MS 
Data Interpretation Guidelines.” SRMAtlas, a large scale effort 
to develop reference spectra for a few peptides for each human 
protein has been completed,” and the ProteomeTools project 
that aims to generate synthetic peptide spectra for nearly 
all accessible human tryptic peptides is ongoing. ^ Efforts are 
underway’! to validate discovery of HPP missing proteins via 
the comparison with SRMAtlas spectra. This process could be 
automated such that comparison of newly proposed HPP missing 
protein detections could easily be checked against available 
synthetic peptide spectra. 


Library Searching for DIA Applications 


While initial interest in spectral libraries was driven by spectral 
library searching of DDA MS datasets, the recent expansion in 
interest has been driven by applications to DIA workflows. 
In these workflows, the precursor ion selection window is much 
wider; thus, the instrument co-fragments many different peptide 
ion species at once, thereby creating highly multiplexed frag- 
mentation spectra. Although library-free approaches to analyzing 
such data continue to emerge, ^ the most common methods 
for analyzing DIA data involve extracting chromatograms for 
each spectral library fragment ion for a given peptide and 
determining based on their presence and co-elution whether a 
given peptide is in the sample. These approaches are asking a 
fundamentally different question to database searching; i.e., 
rather than trying to identify a spectrum, they are asking whether 
there is evidence for a peptide of interest being present in the 
sample. However, while the original libraries created from the 
DDA MS input datasets include all peaks from the peptide ion 
and have been shown to enable peptide identification from DIA 
data,'? the derived libraries destined for use by DIA analysis are 
typically trimmed such that only the top N (where N is often 5, 6, 
or 10) most informative peaks are retained. It should be easy 
to distinguish between the primary archival libraries and the 
derived, trimmed versions intended for special applications. 


Some in the group highlighted that while six peaks may be 
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sufficient to distinguish most peptides from one another,” six 
peaks may not be enough to confidently distinguish among 
different post-translational modification (PTM) isomers based 
on the same peptide sequence. This highlights the need for 
better encoding of metadata in spectral libraries, since the cur- 
rent formats do not support a uniform mechanism for encoding 
whether a library has been trimmed to suit a specific application 
and how that has been done. Also, in the case of PTM isomers, 
certain peaks in the fragment ion spectrum are highly informa- 
tive while others are shared between isomers. Current approaches 
in DIA analysis of peptidoforms include annotation of fragment 
ions based on their capability to act as “unique ion signature” for 
a specific peptidoform.’””* The proposed format will need to 
capture this information on the fragment ion level as well. 

The customary workflow for library-based DIA analysis 
involves the development of the reference library source from 
DDA input data, where most fragmentation spectra are relatively 
pure and FDR control is well understood. However, the 
emergence of library-free DIA analysis techniques with tools 
such as DIA-Umpire, which use co-elution profiles of precursor 
and fragment ions to create filtered, simplified MS/MS spectra 
for searching, enable the possibility of developing spectral 
libraries from DIA data directly. This has the potential advantage 
that the reference spectra are created on the same instrument 
under the same collision energy and selection window con- 
ditions as the eventual subsequent analysis. However, most of 
the Dagstuhl group had serious reservations about such 
approaches, primarily due to the substantial and insufficiently 
understood uncertainties in controlling false positives in highly 
multiplexed spectra when assaying with a limited number of 
peaks. With DIA data, there is a magnified danger of confusing a 
peptide ion with another peptide ion that has a similar sequence 
but with a different mass modification due to the large precursor 
selection windows employed in DIA data. Also, most approaches 
to spectral library generation attempt to create high quality 
libraries from pure compounds to reduce error rates in the 
library itself and further research is needed on how impurities 
and low-quality entries in the spectral library affect DIA analysis. 

Other complicating factors for DIA analysis include 
accounting for the use of trimmed spectral peak lists in the 
initial identification, as the reliability measure attached to the 
library spectrum should be changed. The use of relative fragment 
ion intensities, as well as retention and drift (collisional cross 
section) time are other challenges for reliability estimation. 
The normalized or indexed retention time of a peptide could be 
valuable information for improving the confidence of an iden- 
tification. However, determining retention times for decoy spectra 
is challenging. Current tools address this issue by ensuring that the 
overall distribution of retention times is equal between targets and 
decoys and that peptides of equal AA composition are assigned the 
same retention time. Retention times for decoy spectra could be 
estimated with retention time prediction tools, but these are 
usually less accurate than empirically determined values, so it is 
not clear how reliability estimates can be calibrated when including 
retention time as a factor. 


Highly Similar Spectra 


More broadly, the entire topic of highly similar spectra in a 
library generated good discussion among the group and is an 
area requiring additional research. Tools such as SpectraST” 
include quality control routines that can, at the discretion of the 
user, prune library entries that have highly similar spectra (and 
precursor m/z) to another entry that is not simply a sibling 
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peptidoform (e.g, a singly phosphorylated peptide with the 
phosphorylation at a different site). This can be applied under 
the assumption that either one of the two similar entries is 
misidentified, or, if they are both correct, the spectra are so 
similar that MS cannot effectively distinguish between the two 
with current technology. A better approach may be to develop 
more advanced tools that can assess the ambiguity ^ and provide 
the user with probabilistic set of options (e.g., 99.996 confidence 
that a new spectrum is either peptide ion A or B, but dis- 
tinguishing among those two is only 6096/4096). Clearly, further 
work is required, and the capture of the metadata on which 
choices were made for construction of the library will be important. 


Gold Standard Test Dataset 


A recurring theme of discussion was the need for a gold standard 
dataset and library that could be used in the uniform testing of 
various approaches and tools. None of the current set of existing 
reference datasets summarized at http://compms.org/resour- 
ces/reference-data was deemed suitable for this purpose. The 
group decided that a good standard dataset would consist of one 
spectral library with ~10 000 entries and one mzML file with 
~10 000 spectra, in which ~5000 peptide ions (but not exact 
spectra) were in common. Each of the 10 000 spectra in the two 
files should be derived from synthetic peptides (e.g, from 
ProteomeTools/ or SRMAtlas”'), and, thus, the corresponding 
identities are known precisely. There should be a combination of 
high SNR spectra and low SNR spectra, where the low SNR 
spectra are derived from fragmentation near the fringes of an 
elution profile for which conclusive PSM evidence is available 
from a spectrum obtained near the peak of the same profile. 
A vetting process conducted by several groups to identify 
and discard errors in the spectrum identification list will be 
important to ensure a true gold standard. Efforts are underway to 
produce such a gold standard dataset. 


Bl CONCLUSION 


Spectral libraries remain a substantially under-utilized resource 
in proteomics, with the potential to vastly increase the efficiency 
of research. Other fields, such as metabolomics, have demon- 
strated the utility of spectral libraries; however, concepts from 
metabolomics will not always directly translate to proteomics. 
Future workflows will likely perform more than one stage of 
spectral library searching. The first stage will determine the most 
appropriate libraries to search and suitable parameters, a second 
stage would search against an extensive collection of the most 
suitable community libraries including identified and unidenti- 
fied representative spectra derived from public datasets, and a 
final stage would perform sequence database searching of only 
the high-quality spectra that remain unmatched after spectral 
library searching. This complex workflow should be designed to 
happen with minimal input from the user, and the results from all 
stages should be presented in a unified manner. Newly identified 
peptide species should be automatically added to local spectral 
libraries and optionally contributed to the community libraries, 
similarly to what is already enabled for metabolomics spectral 
libraries at GNPS.°’ Once such workflows become easier, faster, 
and more effective than current techniques, spectral libraries will 
be more widely adopted. 

However, before that can happen, there are still a substantial 
number of challenges that must be overcome. Spectral library 
building, handling, and searching software must become more 
advanced. The cooperative development and inter-operability 
of spectral library-using software requires a widely adopted 
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community standard format, especially one that could encode 
extensive metadata about the library and its contents. The PSI is 
embarking on an effort to create this standard, and wide partici- 
pation from the community will be a key contributing factor. All 
contributions are welcome via https://github.com/HUPO- 
PSI/SpectralLibraryFormat. 

Beyond the development ofthe standard format, there remain 
many open questions in need of addressing by research in the 
community as described above, including how to set up com- 
munity libraries, generate decoys, develop a gold reference 
standard, and how to compare spectra. By building a standard 
spectral library format, creating more advanced analysis software 
that capitalizes on the format, and addressing the remaining 
open research questions, a broad array of biomedical applica- 
tions using all MS-based proteomics technologies will be enabled 
and accelerated. 
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5.3 Proceedings of the EuBIC Winter School 2017 


In 2017, the first EuBIC Winter School on proteomics bioinformatics took 
place in Semmering, Austria. This is an important event for the bioinforma- 
tics community working in the field of proteomics. At the winter school the 
scientific community presents and discussed latest achievements and current 
challenges in this field, advancing the research field of proteomics. The 
developer's meeting is designed to tackle the previously identified challenges 
by bioinformaticians working in this field. The author of this thesis was 
the main organizer of this conference. Together with her co-authors she 
summarized the event and the outcomes of this winter school, resulting in a 
publication in the Journal of Proteomics, 2017. This chapter contains this 
publication titled *Proceedings of the EuBIC Winter School 2017" [91], see 
http://dx.doi.org/10.1016/j.jprot.2017.04.001. 
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The 2017 EuBIC Winter School was held from January 10th to January 13th 2017 in Semmering, Austria. This 
meeting gathered international researchers in the fields of bioinformatics and proteomics to discuss current chal- 
lenges in data analysis and biological interpretation. This article outlines the scientific program and exchanges 
that took place on this occasion and presents the current challenges of this ever-growing field. 

Biological significance: The EUPA bioinformatics community (EuBIC) organized its first winter school in January 
2017. This successful event illustrates the growing need of the bioinformatics community in proteomics to gather 
and discuss current and future challenges in the field. In addition to the organization of yearly meetings, the 
young and active EuBIC community aims to develop new collaborative open source projects, spread bioinformat- 
ics knowledge in Europe, and actively promote data sharing through public repositories. 
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1. Introduction 


The European Bioinformatics Community (EuBIC) initiative, sup- 
ported by the European Proteomics Association (EUPA), is an open com- 
munity of users and bioinformaticians with a special interest in 
proteomics. Importantly, this community initiative is not limited to 
academy, but aims at including actors from the industry. Our ultimate 
goal is to make the field of bioinformatics more accessible in both the 
proteomics and mass spectrometry community. To this end, we are de- 
veloping an open digital infrastructure including an educational 
website, wiki and Q&A (http://www.proteomics-academy.org). Fur- 
thermore, we organized multiple bioinformatics hubs and educational 
workshops at several international conferences. Here, we present our 
latest achievement: the organization of the EuBIC Winter School 2017, 
as follow-up to the 2015 MidWinter Proteomics Bioinformatics 
Seminar. 

This Winter School on Proteomics Bioinformatics was held in 
Semmering, Austria, from January 10th to January 13th. It attracted 
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114 participants from 17 different countries and representatives from 
three sponsor companies: Thermo Fisher Scientific, Waters, and Qiagen 
(see Fig. 1). Indicative of the high interest of the topic, the conference 
venue was fully booked. All participants, from BSc students to group 
leaders, from academy as well as industry, were actively involved with 
networking and forging collaboration opportunities, even during eve- 
ning social events. Ten different workshops were organized on multiple 
topics covering all levels of expertise, as well as nine keynotes from in- 
ternationally renowned speakers, 15 participant flash talks, and two 
poster sessions comprising a total of 39 posters. All abstracts and work- 
shop descriptions can be found at the Winter School homepage in the 
final program (https://www.fh-ooe.at/eubic-ws17 ). 


2. Workshops 


During the first day, three 8-hours workshops were run in parallel 
on general proteomics analyses: (1) an introduction to proteomic data 
analysis with open software intended for novice users (H. Barsnes & 
M. Vaudel) [1], (2) a training on the commercial tools Progenesis QI 
for proteomics (www.nonlinear.com/progenesis/qi-for-proteomics, 
Nonlinear Dynamics & Waters) and Proteome Discoverer (www. 
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Fig. 1. Participants. All participants of the EuBIC Winter School 2017. In the background, the outlook from the conference venue over Semmering remains visible. 


thermofisher.com/order/catalog/product/IQLAAEGABSFAKJMAUH, 
Thermo Fisher Scientific), and (3) a joint de.NBI (German Network for 
Bioinformatics Infrastructure) and ELIXIR (a distributed infrastructure 
for life sciences) hackathon (V. Schwámmle, M. Palmblad, J. Ison, N. 
Beard, G. Mayer & J. Uszkoreit). This was the first outreach of ELIXIR to 
the proteomics community with the presentation of its diverse 
activities, such as the Tools and Data Services Registry (bio.tools) 
and the training platform TeSS. Moreover, the hackathon provided 
a stage for the presentation and discussion of standard formats, 
ontologies, and workflow compositions, all of which are indispensable 
in bioinformatics [2-5]. 

On the second day, participants could choose from four parallel af- 
ternoon workshops on integrative analyses and multi-omics: (1) struc- 
tural interactomics and cross-linking to investigate protein-protein 
interactions (F. Liu) [6], (2) RNA-seq analysis from non-model 
organisms to prepare draft proteome sequence databases (D. Tabb) 
[7], (3) pathway analysis with the commercial tool Ingenuity& Pathway 
Analysis (IPA®, www.qiagenbioinformatics.com/products/ingenuity- 
pathway-analysis, Qiagen) to uncover the biological significance of 
‘omics data, and (4) sharing and interpreting big data from proteomics 
as a community by spectral libraries (N. Bandeira) [8]. 

Finally, on the third day, participants could choose from three work- 
shops on academic bioinformatic tools and resources: (1) an introduc- 
tion to the MaxQuant/Perseus tool suite (J. Cox) [9], (2) an 
introduction to the bioinformatics workflows design using OpenMS (O 
Kohlbacher & J. Pfeuffer) [10], and (3) an introduction to the usage 
and integration of the Reactome database (A. Mundo Fabregat) [11]. 


3. Keynotes 


A total of nine 50-minutes keynotes and discussions were organized 
on specific topics on the morning of each day except the first. Presenting 
authors were asked to present a bioinformatic problem and its context, 
and lead an open debate with the community. 

Wednesday morning started with a presentation of N. Bandeira 
(University of California, San Diego, USA) about constructing communi- 
ty knowledge for peptide identification and quantification. He argued 
that a substantial share of public data is still missing protein identifica- 
tion and quantification, and that efforts from the community can greatly 
complement individual research [8]. The following presentation from D. 
Tabb (Stellenbosch University, Stellenbosch, ZA) highlighted the fact 
that data should in addition undergo more rigorous quality control. To 
this end, a working group from HUPO-PSI is developing a more trans- 
parent quality control infrastructure [7]. J. Ison (Technical University 
of Denmark, Lyngby, DK) and N. Beard (University of Manchester, 
Manchester, UK) gave an overview of their earlier workshop on Tues- 
day, recapitulating the strength and flexibility of ELIXIR's distributed 


computational infrastructure. They notably described ELIXIR's registry 
as "PubMed for software tools and workflows" as opposed to "scientific 
publications" [3]. Finally, F. Liu (Utrecht University, Utrecht, NL) intro- 
duced structural interactomics and its applications in cell biology, show- 
ing the power of combining mass spectrometry with chemical cross- 
linking [6]. 

Thursday, L. Martens (VIB and Ghent University, Ghent, BE) sug- 
gested that sharing data supporting publications has finally become 
standard practice in proteomics. Sadly, much of these data remain un- 
touched, even though they may contain hidden treasures no one was 
initially looking for, such as alternative products of translation [12]. 
Thereafter, O. Kohlbacher (University of Tübingen, Tübingen, DE) 
showed how OpenMS and KNIME simplify the automated processing 
of quantitative proteomics data [10]. A. Fabregat Mundo (European Bio- 
informatics Institute (EBI), Cambridge, UK) concluded the day with an 
overview of Reactome: a curated knowledgebase of biomolecular path- 
ways [11]. 

On the last day, J. Cox (Max Planck Institute of Biochemistry, 
Martinsried, DE) demonstrated the use of MaxQuant and Perseus, two 
software packages to analyze large-scale (prote)omics data [9]. The 
final presentation of the Winter School was given by M. Palmblad 
(Leiden University Medical Center, Leiden, NL). He remarked that uni- 
fied ontologies are necessary to interpret and integrate both publicly 
available data and software [4]. 


4. Posters & flash talks 


Participants had the opportunity to present their scientific work 
in two poster sessions. All attendants were very impressed by the 
quality of the posters presented, showing the increasing quality of 
the work of the bioinformatics community. Both poster sessions 
were perfect occasions for the attendants to interact, yielding long 
and animated debates (until late in the night) as well as new collab- 
orations and projects. 15 posters were selected for flash talks prior to 
the conference through peer review. Topics included, but were 
not limited to: analysis of metaproteomes, spectrum clustering, 
phosphoproteomics, cancer studies, analysis of post-translational 
modifications, and novel tools and implementations for statistical 
analysis. The poster award was awarded to T. Muth (Robert Koch 
Institute, Berlin, DE) for his poster "Analyzing metaproteome 
samples on the go: the full-featured MPA portable software provides 
protein identification enriched with taxonomic and functional meta- 
information." [13] while the flash talk award went to L. Goeminne 
(VIB and Ghent University, Ghent, BE) for his talk "MSqRob: analysis 
of label-free proteomics data in an R/Shiny environment” [14] (see 
Fig. 2). 
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Fig. 2. Best poster and flash talk awards. The poster prize was awarded to T. Muth while the flash talk award went to L. Goeminne after deliberation from juries led by Lennart Martens and 
David Tabb. Karl Mechtler was Master of Ceremony of the flash talk session. From left to right: Veit Schwámmle, Marc Vaudel, Viktoria Dorfer, David Bouyssié, Marie Locard-Paulet, Thilo 


Muth, Ludger Goeminne, Karl Mechtler and Lennart Martens. 


5. Outlook 


Even though EuBIC is a very young initiative, the Winter School 
reached a large audience. Representatives of major organizations such 
as EuPA, ELIXIR, and the EBI qualified the organization of the Winter 
School of major success (despite the small practical issues inherent to 
the unexpectedly large number of attendants). Strengthened by this ap- 
preciation, EuBIC will pursue its efforts to ensure the Winter School is 
organized annually, switching each year between a Developers Meeting 
aimed at bioinformaticians, and a General Meeting open to everyone. 
We strongly believe this Winter School was a major step towards our ul- 
timate goal of making the field of bioinformatics more accessible in the 
scientific community. We encourage all scientists willing to contribute 
to this endeavor to join the initiative and contact us via our open 
forum, https://groups.google.com/forum/#!forum/eupa-bioinfo, or di- 
rectly on http://www.proteomics-academy.org. 


Transparency document 


The Transparency document associated with this article can be 
found, in the online version. 


Acknowledgements 


Funding for the 2017 EuBIC Winter School was obtained from the 
following sponsors: ELIXIR Denmark, Qiagen, Waters, Thermo Fisher 
Scientific, the University of Applied Sciences Upper Austria, the Univer- 
sity of Bergen, the University of Southern Denmark, the Medizinisches 
Proteom-Center, the Institute of Pharmacology and Structural Biology, 
and the European Proteomics Association (EuPA). 

This Winter School could not have been realized without the help of 
all EuBIC members, and especially the following people: Dominik 
Kopczynski, Hayley Price, and Alessio Soggiu. The organizing committee 
especially thanks the speakers and workshops organizers for bringing 
the Winter School to such a level of excellence, companies and exhibi- 
tors for their active participation in this community initiative, and all 


participants for the active and fruitful interactions. We address a special 
thanks to Andrea Urbani for his indefectible support. 


References 


[1] M. Vaudel, J.M. Burkhart, R.P. Zahedi, E. Oveland, F.S. Berven, A. Sickmann, et al., 
PeptideShaker enables reanalysis of MS-derived proteomics data sets, Nat. 
Biotechnol. 33 (2015) 22-24. 

[2] E.W. Deutsch, J.P. Albar, P.-A. Binz, M. Eisenacher, AR. Jones, G. Mayer, et al., Devel- 
opment of data representation standards by the human proteome organization pro- 
teomics standards initiative, J. Am. Med. Inform. Assoc. 22 (2015) 495-506. 

[3] J. Ison, K. Rapacki, H. Ménager, M. Kalaš, E. Rydza, P. Chmura, et al., Tools and data 
services registry: a community effort to document bioinformatics resources, Nucleic 
Acids Res. 44 (2016) D38-D47. 

[4] AT.Guler, CJ.F. Waaijer, Y. Mohammed, M. Palmblad, Automating bibliometric anal- 
yses using Taverna scientific workflows: a tutorial on integrating web services, J. 
Informet. 10 (2016) 830-841. 

[5] Y. Perez-Riverol, J. Uszkoreit, A. Sanchez, T. Ternent, N. del Toro, H. Hermjakob, et al., 
ms-data-core-api: an open-source, metadata-oriented library for computational 
proteomics, Bioinformatics 31 (2015) 2903-2905. 

[6] F. Liu, D.T.S. Rijkers, H. Post, A.J.R. Heck, Proteome-wide profiling of protein assem- 
blies by cross-linking mass spectrometry, Nat. Methods 12 (2015) 1179-1184. 

[7] J. Griss, Y. Perez-Riverol, S. Lewis, D.L. Tabb, J.A. Dianes, N. del Toro, et al., Recogniz- 

ing millions of consistently unidentified spectra across hundreds of shotgun prote- 

omics datasets, Nat. Methods 13 (2016) 651-656. 

M. Wang, JJ. Carver, V.V. Phelan, L.M. Sanchez, N. Garg, Y. Peng, et al., Sharing and 

community curation of mass spectrometry data with Global Natural Products Social 

Molecular Networking, Nat. Biotechnol. 34 (2016) 828-837. 

S. Tyanova, T. Temu, J. Cox, The MaxQuant computational platform for mass spec- 

trometry-based shotgun proteomics, Nat. Protoc. 11 (2016) 2301-2319. 

[10] H.L Rost, T. Sachsenberg, S. Aiche, C. Bielow, H. Weisser, F. Aicheler, et al., OpenMS: a 
flexible open-source software platform for mass spectrometry data analysis, Nat. 
Methods 13 (2016) 741-748. 

[11] A. Fabregat, K. Sidiropoulos, P. Garapati, M. Gillespie, K. Hausmann, R. Haw, et al., 
The Reactome pathway knowledgebase, Nucleic Acids Res. 44 (2016) D481-D487. 

[12] L. Martens, J.A. Vizcaíno, A golden age for working with public proteomics data, 
Trends Biochem. Sci. (2017 Jan 21) pii: S0968-0004(17)30001 -4. 

[13] T. Muth, A. Behne, R. Heyer, F. Kohrs, D. Benndorf, M. Hoffmann, et al., The 
MetaProteomeAnalyzer: a powerful open-source software suite for metaproteomics 
data analysis and interpretation, J. Proteome Res. 14 (2015) 1557-1565. 

[14] LJ.E. Goeminne, K. Gevaert, L. Clement, Peptide-level robust ridge regression im- 
proves estimation, sensitivity, and specificity in data-dependent quantitative 
label-free shotgun proteomics, Mol. Cell. Proteomics 15 (2016) 657-668. 


[8 


19 


5.4 Proceedings of the EuBIC Developer's Meet- 
ing 2018 


Alternating the annual EuBIC conference between a Winter School and 
a Developer's Meeting, the first EuBIC developer's meeting took place in 
Ghent, Belgium, 2018. The author of this thesis was one of the co-organizers 
of this event and participated in making the findings of this meeting available 
to the proteomics community through a publication. This section contains 
the corresponding publication, titled ^Proceedings of the EuBIC developer's 
meeting 2018", published in the Journal of Proteomics, 2018, [92], see ht- 
tps:/ /doi.org/10.1016/j.jprot.2018.05.015. 
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ABSTRACT 


The inaugural European Bioinformatics Community (EuBIC) developer's meeting was held from January 9th to 
January 12th 2018 in Ghent, Belgium. While the meeting kicked off with an interactive keynote session featuring 
four internationally renowned experts in the field of computational proteomics, its primary focus were the 
hands-on hackathon sessions which featured six community-proposed projects revolving around three major 


topics: 


1. quality control 


2. workflows, protocols, and guidelines 


3. quantification. 


Here, we present an overview of the scientific program of the EuBIC developer's meeting and provide a 
starting point for follow-up on the covered projects. 


1. Introduction 


The European Bioinformatics Community (EuBIC) is an initiative of 
the European Proteomics Association (EuPA) to promote the use of 
bioinformatics for computational mass spectrometry (MS) and MS- 
based proteomics. Our goal is to bring together the European MS 
bioinformatics community, including students and early-career re- 
searchers as well as long-standing experts from both academia and in- 
dustry. Through the setup of community-driven dynamics, EuBIC 
mainly focuses on improving education in computational methods, job 
and funding opportunities, international collaborations, publication of 
specialized studies, and training of software tools. To this end, EuBIC 
maintains several web resources that include educational videos, grant 
overviews, a job fair, and tutorials (https://www.proteomics-academy. 
org/). Besides these online resources, EuBIC regularly organizes 


workshops and hubs at the major international conferences on com- 
putational MS and proteomics. Additionally, an annual conference on 
computational MS-based proteomics is organized by EuBIC itself, 
forming an important community outreach effort to bring together 
bioinformatics researchers from all over Europe. 

The first EuBIC conference took place in January 2017 in 
Semmering, Austria [13]. As this turned out to be an overwhelming 
success, we envisioned to organize the EuBIC conference as an annual 
series. However, although this event brought together the European 
proteomics community, we observed that not all computational ex- 
pertise was utilized to its full potential in the typical conference setup 
consisting of presentations and workshops. Therefore we decided to 
alternate the annual EuBIC conference between a Winter School tar- 
geting a broad end user-oriented audience and a developer's meeting for 
software developers. 
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Fig. 1. Participants of the EuBIC developer's meeting 2018. 


The inaugural EuBIC developer's meeting was organized in Ghent, 
Belgium, from January 9th to January 12th 2018 (http://uahost. 
uantwerpen.be/eubic18/). A total of 43 participants Fig. 1, including 
students, keynote speakers, and industry representatives from 14 dif- 
ferent countries participated in the developer's meeting. To stimulate 
direct collaboration and the active development of bioinformatics ap- 
plications, its main activity was a hackathon focusing on six important 
topics in computational proteomics which were crowd-sourced from the 
community. Additionally, prior to these hackathon sessions the meeting 
participants engaged in an interactive keynote session led by four in- 
ternationally renowned scientists with experience in tool development 
for MS-based proteomics. 


2. Keynote presentations 


The EuBIC developer's meeting kicked off with four keynote pre- 
sentations illustrating some important current drawbacks of MS-based 
data analysis and the crucial role of bioinformatics in solving these 
outstanding issues. 

Prof. dr. Lennart Martens of Ghent University, Belgium, opened the 
meeting by describing his vision on the role of a bioinformatics scientist 
as a “researcher-developer”. As life sciences research has accelerated 
enormously over the past two decades, nowadays it is heavily domi- 
nated by the huge amount of data that are generated and the advanced 
algorithmic techniques that are necessary to analyze these data. He 
outlined that the job of a researcher-developer is to use and develop 
sophisticated algorithms and powerful tools to increase our under- 
standing of the sheer complexity of biological systems [5]. This was 
followed by an interactive discussion on career aspects and the growth 
path of bioinformatics researchers. 

Next, dr. Frédérique Lisacek of the Swiss Institute of Bioinformatics 
(SIB), Switzerland, presented her work on bridging proteomics and 
glycomics. She described difficulties prohibiting the fully automated 
identification of glycoproteomics data and explained how her group has 
tackled some of these issues. By making use of open modification 
searching peptides with previously unconsidered post-translational 
modifications (PTMs) could be successfully identified [4]. Next, she 
explained how new computational tools can be used for the analysis of 
glycoproteomics data [3]. 

The third keynote speaker was dr. Laurent Gatto of the University of 
Cambridge, England, who gave a presentation on the ecosystem of 
open-source tools in the R programming language for the analysis of MS 
data [2]. Dr. Gatto showed a historical perspective on how increasingly 
powerful and popular R packages for the analysis of proteomics data 
have been developed. Based on a few use cases he demonstrated how 
several popular packages are related to each other and reinforce each 
other, thereby illustrating the effectiveness of open-source. 

The final keynote speaker was prof. dr. Lukas Käll of the KTH Royal 
Institute of Technology, Sweden. Prof. Käll explained that although the 
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characterized analytes in an MS proteomics experiment are peptides, 
researchers are typically interested in their parent proteins instead. As a 
result, protein inference has to be performed to reassemble protein 
sequences from the measured peptide sequence data. Based on simu- 
lated data and a sample of known content, prof. Käll demonstrated the 
effect of different design choices of protein inference algorithms [9]. 
Furthermore, he discussed the protein summarization problem, which 
aims to recreate proteins' relative concentration from peptides' abun- 
dances, and his Diffacto algorithm [14]. 

In addition to these invited scientific keynotes two sponsored pre- 
sentations were given by company representatives. First, Adam 
Tenderholt from Veritomyx presented the PeakInvestigator"" software, 
which helps with deconvoluting and centroiding mass spectra. Second, 
Lyle Burton from SCIEX explained which application programming in- 
terfaces (APIs) they provide and how to use them. He also showed some 
examples of how these APIs are already used in open source and pro- 
prietary projects. 


3. Hackathon 


During the subsequent days of the EuBIC developer's meeting the 
participants split up into small groups to actively develop bioinfor- 
matics applications. Project proposals for the hackathon sessions were 
crowd-sourced in a transparent and open process. Prior to the devel- 
oper's meeting community members could submit project proposals for 
inclusion in the hackathon, which were subsequently evaluated on 
scientific merit and community interest. This resulted in a hackathon 
program consisting of six different projects in three main tracks: 


1. quality control 
2. workflows, protocols, and guidelines 
3. quantification. 


3.1. Quality control 


3.1.1. Dashboard for longitudinal QC monitoring 

During this hackathon session the participants developed a web tool 
for the visualization and analysis of quality control (QC) metrics. Based 
on data in the qcML format [11] an interactive R/Shiny dashboard was 
developed using a microservice architecture. The dashboard includes 
functionality to visualize specific QC metrics longitudinally and per- 
form a robust principal component analysis to detect low-performing 
experiments [12]. 


3.1.2. Data management and instrument performance monitoring 

During this hackathon session the participants added novel func- 
tionality to assess the quality of an MS experiment to the Proline-MS- 
Angel proteomics management software system. First, an execution 
environment to run external scripts was added to MS-Angel to extract 
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QC metrics from experimental raw files. Second, a semi-supervised 
approach to discriminate between high-quality and low-quality ex- 
periments was implemented in MS-Angel [8]. Third, the session parti- 
cipants established a roadmap to implement further QC features to 
Proline and MS-Angel. 


3.2. Workflows, protocols, and guidelines 


3.2.1. Implementation of software protocols in computational proteomics 

During this hackathon session the participants created a framework 
to implement fully documented and interactive protocols describing 
how to successfully carry out popular workflows to analyze MS data. 
Controlled environments in which to perform specific tasks were cre- 
ated using Docker containers and Jupyter notebooks to allow the full 
reproducibility of analysis pipelines and workflows. 


3.2.2. Third-party tool integration and method development in OpenMS 

The participants of this session first got an introduction to the 
OpenMS software platform [7]. Afterwards they developed their own 
plugins under the guidance of experienced OpenMS maintainers. Ex- 
amples of new OpenMS plugins that were developed include the 
MaRaCluster algorithm for spectral clustering [10]. 


3.3. Quantification 


3.3.1. Statistical modelling to improve the quantitative analysis of post- 
translationally modified peptides 

Using a recent phosphoproteomics dataset [6], the participants of 
this session evaluated three strategies for the differential analysis of 
PTMs: 


1. based on modified peptides only 

2. based on modified peptides and any unmodified peptides from the 
corresponding protein 

3. based on modified peptides, their unmodified counterparts, and any 
other unmodified peptides from the corresponding protein. 


For each of these three cases linear models were developed to de- 
scribe the quantification of modified peptides under different condi- 
tions. 


3.3.2. Novel algorithms for DIA-based label-free quantification 

During this hackathon session the participants created new algo- 
rithms for label-free quantification of data-independent acquisition 
datasets to be included in IsoQuant [1]. A density-based clustering 
approach was developed to group corresponding features across the 
retention time, mass, and drift time dimensions. 


4. Conclusion and outlook 


The inaugural edition of the EuBIC developer's meeting was a re- 
sounding success. In a follow-up survey all participants expressed their 
overall satisfaction with the meeting, with two thirds of the survey 
respondents giving it a perfect score. Participants especially indicated 
that they enjoyed the unique interactive nature of the hackathon ses- 
sions. As envisioned, the restricted number of attendees allowed many 
interactions and facilitated effective communication and collaboration. 

Even though the EuBIC developer's meeting only ran for a few days 
significant progress was made during the hackathon sessions on all 
projects. We are encouraged by the productivity of the participants to 
start solving important problems in only a limited time. The hackathon 
groups have committed to continue their collaboration and complete 
their projects, which will hopefully lead to scientific publications and 
ultimately better software solutions for MS-based proteomics end users. 

Encouraged by the enthusiastic support of the community we are 
already planning the next EuBIC Winter School, which will take place in 
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January 2019 in Zakopane, Poland. 
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Chapter 6 


Discussion 


Identifying peptides from mass spectra is the first step towards the inves- 
tigation of a biological sample, the mechanisms taking place in a cell and 
the active proteins. The application of an algorithm suited for these de- 
mands and designed to exploit the possibilities of modern instruments is 
therefore crucial. MS Amanda, the search engine described in this thesis 
(see Chapter 3), has been developed to perfectly meet these needs and has 
proven to be able to cope with the presented challenges. Compared to the 
search engine Mascot [76], which is despite its drawbacks still mostly used in 
the proteomics community, MS Amanda is able to identify up to 5096 more 
reliable PSMs at the same false discovery rate. This increase is enabled by 
the improved scoring function utilized in MS Amanda. The following three 
elements are key factors in this case: 


1. Calculation of the binomial coefficient: In contrast to other 
search engines, that also use a binomial scoring function, such as An- 
dromeda [12], N, the number of peaks that can be at most matched, is 
defined by the number of picked peaks out of the experimental mass 
spectrum. 


2. Estimation of probability p to match a peak by chance: The 
formula to calculate the probability to match a peak by chance has 
been designed to be able to accurately incorporate fragment mass to- 
lerances in ppm (parts per million) to account for the potential of 
modern mass spectrometers. 


3. Consideration of peak intensities: Peaks intensities of matched 
peaks are incorporated in the scoring function enabling a discrimina- 
tion of peptides matching the same number of peaks, favouring pep- 
tides matching the higher peaks, as these are more relevant for the 
spectrum. 


MS Amanda has been the basis for further developments. Despite the 
achievements of this new algorithm in terms of spectrum identification, a 
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lot of spectra were lacking the assignment of a confident peptide and a de- 
cent amount of identified spectra still contained high intense peaks that 
remained unexplained. These issues indicated the distinct presence of chi- 
meric spectra, which has already been observed before [3]. The exi- 
sting solutions capable of identifying co-eluting precursors [94] [79] were 
hardly used in an everyday proteomics workflow, although the potential of 
this information is tremendous and easily retrievable. To ease the access 
and usage of chimeric spectra identification, new algorithms have been de- 
veloped for a chimeric search functionality. Several aspects and strategies 
have been considered and tested, the following mechanism has proven to be 
most successful (see also Chapter |4): 


e First search: A first peptide identification search is performed using 
the specified precursor mass and corresponding PSMs are stored and 
reported. 


e Spectrum cleaning: Spectra are cleaned for peaks already identi- 
fied by the best matching peptide of the first search. The overlap of 
shared peaks of peptides from co-eluting precursors has been proven 
to be negligible. 


e Co-eluting precursor identification: MS1 spectra are investigated 
and potential co-eluting precursors within the isolation window are 
identified. 


e Second search: For every potential identified co-eluting precursor 
the cleaned spectrum is cloned and the corresponding precursor is 
assigned. Spectra are searched again using the MS Amanda algorithm. 


Results show that depending on the instrument settings up to 5096 of all 
spectra carry an additional peptide that can be reliably identified. More 
than 4096 additional unique peptides can be identified even for narrow iso- 
lation windows, increasing to almost 50% for an isolation width of 4m/z. 
On average, 20% of all spectra even contained more than two peptides. The 
identification of chimeric spectra has proven to be an indispensable task 
when it comes to gaining deep insight into a biological sample. Although the 
underlying approach is computationally challenging, results revealed that it 
is worth investing the time. 

Having successfully identified tandem mass spectra, validation of the 
matches is of high importance. Several approaches have been already con- 
ducted performing this tasks, as also shown in Chapter |4| and Section 
using retention time prediction and white box modeling. Similar strategies 
have been followed recently by Granholm et al. or Tu et al. [85]. 

Several spectra still remain unexplained. This may be due to lack of 
corresponding proteins in the database or — more likely — due to unconsidered 
PTMs. We have already conducted research into the prior identification of 
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modified spectra before database [25], still there is more work to be done. 
Another very promising approach is the work on spectral library identifi- 
cation methods, which we are further pursuing; we have already performed 
some work in this area [18]. This thesis will serve as a perfect foundation 
for further research endeavors. 
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